From video to virtual: object-centric 3D scene understanding from videos

Bhalgat, YS

Thesis

From video to virtual: object-centric 3D scene understanding from videos

Abstract:: Understanding the 3D structure of our world from casual videos remains a central challenge in computer vision. Videos provide natural supervision through motion and viewpoint changes, yet inferring geometry, objectness, and semantics directly from such unconstrained input remains difficult. This thesis develops methods for object-centric 3D scene understanding from video, combining geometric priors, neural fields, and foundation models for both static and dynamic environments.

The work begins by learning how different views relate to one another -- a prerequisite for any model that aims to understand 3D structure. We teach vision transformers to internalize multi-view geometry through an epipolar-aware attention objective that softly enforces geometric consistency across viewpoints, yielding viewpoint-invariant correspondences without requiring camera poses at inference.

Once geometric reasoning is established, modeling object structure within static scenes becomes a natural next step. We achieve this by "lifting" 2D instance predictions from large segmentation models into a neural feature field with a slow-fast contrastive objective. This approach fuses inherently multi-view inconsistent information, viz. untracked 2D instance segmentations, to recover coherent 3D object instances in cluttered environments, without any 3D supervision.

We extend this formulation to represent not only objects but also their hierarchical and semantic relations. Our proposed nested neural feature field encodes part-object-scene structure and aligns it with language-based embeddings, enabling open-vocabulary reasoning and efficient querying of complex indoor scenes.

Finally, we add knowledge of motion by proposing a framework for dynamic 3D scene understanding in egocentric videos that integrates segmentation, 2D-to-3D lifting, and geometry-aware association to track objects over time, maintaining identity under motion and occlusion while supporting amodal reconstruction. Together, these contributions unify geometry, structure, semantics, and dynamics for learning object-centric 3D representations from everyday videos.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Bhalgat, Y. S. (2025). From video to virtual: object-centric 3D scene understanding from videos [PhD thesis]. University of Oxford.

MLA Style

Bhalgat, YS. From Video to Virtual: Object-Centric 3D Scene Understanding from Videos. 2025. University of Oxford, PhD thesis.

Chicago Style

Bhalgat, YS. 2025. “From Video to Virtual: Object-Centric 3D Scene Understanding from Videos.” PhD thesis, University of Oxford.
Print