Thesis
Unsupervised object learning
- Abstract:
- The visual world consists of discrete, meaningful objects, which humans effortlessly perceive and segment without supervision. Emulating this ability on machines is a fundamental problem in computer vision, offering a more cognitively plausible and scalable alternative to supervised methods. In this thesis, we explore principled methods to uncover visual objects without relying on dense mask annotations. Firstly, we explore the principle of compositionality, which posits that scenes are composed of discrete, reusable objects. Numerous methods based on this principle exist, yet we note that they are limited to simplistic environments. We introduce a series of new benchmark datasets to analyse whether the current methods can scale to visually complex inputs. Most formulations do not handle the complexity of scenes well, requiring simpler uniform appearances to produce a good segmentation. Secondly, we explore the principle of common fate, which posits that entities that move together should be grouped together. We design several loss functions that connect mask predictions with estimates of scene motion to handle binary and multi-object scenarios. Our proposed formulations can be applied to various existing segmentation methods, complementing their learning principles with learning from motion. We then consider the limitations of instantaneous motion and propose incorporating long-term motion information using sparse point trajectories. To enable this, we design a loss function that enforces the idea that trajectories within an object should have much redundancy. Finally, we explore how existing structures in language can be used to learn object segmentation without the need for any dense mask annotations. We construct a method for open-vocabulary segmentation that uses a pre-trained text-to-image diffusion model to connect language with visual representations of objects. It avoids the need for any further training, showing how text-to-image diffusion models are also powerful open-vocabulary segmentation methods.
Actions
Authors
Contributors
+ Vedaldi, A
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Research group:
- Visual Geometry Group (VGG)
- Role:
- Supervisor
- ORCID:
- 0000-0003-1374-2858
+ Rupprecht, C
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Computer Science
- Research group:
- Visual Geometry Group (VGG)
- Role:
- Supervisor
+ Laina, I
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Research group:
- Visual Geometry Group (VGG)
- Role:
- Supervisor
+ Prisacariu, V
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Research group:
- Active Vision Lab
- Role:
- Examiner
+ Brox, T
- Institution:
- University of Freiburg
- Role:
- Examiner
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Funding agency for:
- Karazija, L
- Grant:
- EP/S024050/1
- Programme:
- EPSRC CDT in Autonomous Intelligent Machines and Systems
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Subjects:
- Deposit date:
-
2025-11-22
Terms of use
- Copyright holder:
- Laurynas Karazija
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record