Thesis icon

Thesis

Self-supervised video understanding

Abstract:

The advent of deep learning has brought about great progress on many funda- mental computer vision tasks such as classification, detection, and segmentation, which describe the categories and locations of objects in images and video. There has also been much work done on supervised learning—teaching machines to solve these tasks using human-annotated labels. However, it is insufficient for machines to know only the names and locations of certain objects; many tasks require a deeper understanding of the complex physical world—how objects interact with their surroundings, for example (often by creating shadows, reflections, surface deformations, and other visual effects). Furthermore, training models to solve these tasks while relying heavily on human supervision is costly and impractical to scale. Thus, this thesis explores two directions: first, we aim to go beyond segmentation and address a wholly new task: grouping objects with their correlated visual effects (e.g. shadows, reflections, or attached objects); second, we address the fundamental task of video object segmentation in a self-supervised manner, without relying on any human annotation.

To automatically group objects with their correlated visual effects, we adopt a layered approach: we aim to decompose a video into object-specific layers which contain all elements moving with the object. One application of these layers is that they can be recombined in new ways to produce a highly realistic, altered version of the original video (e.g. removing or duplicating objects, or changing the timing of their motions). Here the key is to leverage natural properties of convolutional neural networks to obtain a layered decomposition of the input video. We design a neural network that outputs layers for a video by overfitting to the video. We first introduce a human-specific method, then show how it can be adapted to arbitrary object classes, such as animals or cars.

Our second task is video object segmentation: producing pixel-wise labels (segments) for objects in videos. Whereas our previous work is optimized on a single video, here we take a data-driven approach and train on a large corpus of videos in a self-supervised manner. We consider two different task settings: (1) semi-supervised object segmentation, where an initial object mask is provided for a single frame and the method must propagate this mask to the remaining frames, and (2) moving object discovery, where no mask is given and the method must segment the salient moving object. We explore two different input streams: RGB and optical flow, and discuss their connection to the human visual system.

Actions


Access Document


Files:

Authors


More by this author
Division:
MPLS
Department:
Engineering Science
Role:
Author

Contributors

Role:
Supervisor
Role:
Examiner
Role:
Examiner


More from this funder
Funder identifier:
http://dx.doi.org/10.13039/501100000266
Grant:
EP/M013774/1
Programme:
EPSRC Programme Grant Seebibyte
More from this funder
Programme:
Oxford-Google DeepMind Graduate Scholarship


Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Language:
English
Keywords:
Deposit date:
2022-04-15

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP