Thesis icon

Thesis

From video to virtual: object-centric 3D scene understanding from videos

Abstract:
Understanding the 3D structure of our world from casual videos remains a central challenge in computer vision. Videos provide natural supervision through motion and viewpoint changes, yet inferring geometry, objectness, and semantics directly from such unconstrained input remains difficult. This thesis develops methods for object-centric 3D scene understanding from video, combining geometric priors, neural fields, and foundation models for both static and dynamic environments.

The work begins by learning how different views relate to one another -- a prerequisite for any model that aims to understand 3D structure. We teach vision transformers to internalize multi-view geometry through an epipolar-aware attention objective that softly enforces geometric consistency across viewpoints, yielding viewpoint-invariant correspondences without requiring camera poses at inference.

Once geometric reasoning is established, modeling object structure within static scenes becomes a natural next step. We achieve this by "lifting" 2D instance predictions from large segmentation models into a neural feature field with a slow-fast contrastive objective. This approach fuses inherently multi-view inconsistent information, viz. untracked 2D instance segmentations, to recover coherent 3D object instances in cluttered environments, without any 3D supervision.

We extend this formulation to represent not only objects but also their hierarchical and semantic relations. Our proposed nested neural feature field encodes part-object-scene structure and aligns it with language-based embeddings, enabling open-vocabulary reasoning and efficient querying of complex indoor scenes.

Finally, we add knowledge of motion by proposing a framework for dynamic 3D scene understanding in egocentric videos that integrates segmentation, 2D-to-3D lifting, and geometry-aware association to track objects over time, maintaining identity under motion and occlusion while supporting amodal reconstruction. Together, these contributions unify geometry, structure, semantics, and dynamics for learning object-centric 3D representations from everyday videos.

Actions

Access Document

Files:

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Research group:
Visual Geometry Group
Oxford college:
St Cross College
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Supervisor
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Supervisor
ORCID:
0000-0002-8945-8573
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Supervisor
ORCID:
0000-0003-1374-2858
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Supervisor
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Examiner


More from this funder
Funder identifier:
https://ror.org/0439y7842
Funding agency for:
Bhalgat, YS
Grant:
EP/S024050/1
Programme:
Autonomous Intelligent Machines and Systems CDT


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP