Thesis
From video to virtual: object-centric 3D scene understanding from videos
- Abstract:
-
Understanding the 3D structure of our world from casual videos remains a central challenge in computer vision. Videos provide natural supervision through motion and viewpoint changes, yet inferring geometry, objectness, and semantics directly from such unconstrained input remains difficult. This thesis develops methods for object-centric 3D scene understanding from video, combining geometric priors, neural fields, and foundation models for both static and dynamic environments.
The work begins by learning how different views relate to one another -- a prerequisite for any model that aims to understand 3D structure. We teach vision transformers to internalize multi-view geometry through an epipolar-aware attention objective that softly enforces geometric consistency across viewpoints, yielding viewpoint-invariant correspondences without requiring camera poses at inference.
Once geometric reasoning is established, modeling object structure within static scenes becomes a natural next step. We achieve this by "lifting" 2D instance predictions from large segmentation models into a neural feature field with a slow-fast contrastive objective. This approach fuses inherently multi-view inconsistent information, viz. untracked 2D instance segmentations, to recover coherent 3D object instances in cluttered environments, without any 3D supervision.
We extend this formulation to represent not only objects but also their hierarchical and semantic relations. Our proposed nested neural feature field encodes part-object-scene structure and aligns it with language-based embeddings, enabling open-vocabulary reasoning and efficient querying of complex indoor scenes.
Finally, we add knowledge of motion by proposing a framework for dynamic 3D scene understanding in egocentric videos that integrates segmentation, 2D-to-3D lifting, and geometry-aware association to track objects over time, maintaining identity under motion and occlusion while supporting amodal reconstruction. Together, these contributions unify geometry, structure, semantics, and dynamics for learning object-centric 3D representations from everyday videos.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 91.9MB, Terms of use)
-
Authors
Contributors
+ Henriques, J
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
+ Zisserman, A
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-8945-8573
+ Vedaldi, A
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- ORCID:
- 0000-0003-1374-2858
+ Laina, I
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
+ Rupprecht, C
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Computer Science
- Role:
- Examiner
+ Engineering and Physical Sciences Research Council
More from this funder
- Funder identifier:
- https://ror.org/0439y7842
- Funding agency for:
- Bhalgat, YS
- Grant:
- EP/S024050/1
- Programme:
- Autonomous Intelligent Machines and Systems CDT
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-05-04
- ARK identifier:
Terms of use
- Copyright holder:
- Yash Sanjay Bhalgat
- Copyright date:
- 2025
- Notes:
- N2F2: hierarchical scene understanding with nested neural feature fields, 3D-aware instance segmentation and tracking in egocentric videos, A light touch approach to teaching transformers multi-view geometry, and Contrastive lift: 3D object instance segmentation by slow-fast contrastive fusion are derived from this thesis.
If you are the owner of this record, you can report an update to it here: Report update to this record