Conference item
Video representation learning by dense predictive coding
- Abstract:
- The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for selfsupervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatialtemporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with selfsupervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101 (75.7% top1 acc) and HMDB51 (35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet. The code is available at https://github.com/TengdaHan/DPC.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Authors
- Publisher:
- Computer Vision Foundation
- Publication date:
- 2019-11-02
- Acceptance date:
- 2019-08-15
- Event title:
- First Workshop on Large Scale Holistic Video Understanding
- Event series:
- IEEE International Conference on Computer Vision 2019
- Event location:
- Seoul, Korea
- Event website:
- https://holistic-video-understanding.github.io/workshops/iccv2019.html
- Event start date:
- 2019-10-27
- Event end date:
- 2019-11-02
- Language:
-
English
- Keywords:
- Pubs id:
-
pubs:1060202
- UUID:
-
uuid:16d379d6-776e-4a97-a440-5180a2d782a5
- Local pid:
-
pubs:1060202
- Source identifiers:
-
1060202
- Deposit date:
-
2019-10-04
Terms of use
- Copyright holder:
- Han et al.
- Copyright date:
- 2019
- Rights statement:
- © The Authors 2019.
- Notes:
- This paper was presented at the First International Workshop on Large Scale Holistic Video Understanding, part of the International Conference on Computer Vision 2019, Seoul, South Korea, October-November 2019. This is the publisher's version of the paper, provided Open Access by the Computer Vision Foundation.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record