Thesis
Visual understanding of the physical world
- Abstract:
-
We are living in a 3D physical world, and a first step towards artificial general intelligence is to enable machines to understand the physical world. This is the goal of the thesis, and it is structured around three themes: (1) understanding occlusion and occlusion handling, (2) understanding the 3D and physical properties of the scene, and (3) bridging visual understanding with language. For all the themes, we build our method on top of large-scale pre-trained models or their representations.
For occlusion understanding and handling, we design a tri-layer plugin for conventional pre-trained object detectors to improve the performance of object detection and instance segmentation under occlusion. As an additional contribution on occlusion, we advance the amodal completion model to recover the complete shape of occluded objects, by utilising the prior of pre-trained Stable Diffusion model.
For 3D physical understanding, we start with static 3D physical properties in images. To this end, we set up a protocol to probe large-scale pre-trained visual foundation models for the understanding of such properties. Additionally, we also study dynamic 3D physical properties in videos, and explore predicting these properties from different types of large-scale pre-trained video foundation models.
For visual-language understanding, we focus on improving visual-language foundation models. On the CLIP-like large-scale pre-trained models, we improve their performance for text-to-image retrieval by introducing a learnable prompt for the visual encoder conditioned on the text; On the ChatGPT-like large-scale pre-trained models, we improve their performance and efficiency for visual grounding by equipping a small model with multi-modal reasoning capability.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 45.6MB, Terms of use)
-
Authors
Contributors
+ Zisserman, A
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-8945-8573
+ Xie, W
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Engineering Science
- Role:
- Supervisor
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2026-04-20
- ARK identifier:
Terms of use
- Copyright holder:
- Guanqi Zhan
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record