When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models

Ma, X; Bhalgat, Y; Smart, B; Chen, S; Li, X; Ding, J; Gu, J; Chen, DZ; Peng, S; Bian, JW; Torr, P; Pollefeys, M; Nießner, M; Reid, ID; Chang, AX; Laina, I; Prisacariu, VA

AI Collection

Conference item

When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models

Abstract:: As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: https://github.com/ActiveVisionLab/Awesome-LLM-3D.

Publication status:: Accepted

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Ma, X., Bhalgat, Y., Smart, B., Chen, S., Li, X., Ding, J., Gu, J., Chen, D. Z., Peng, S., Bian, J. W., Torr, P., Pollefeys, M., Nießner, M., Reid, I. D., Chang, A. X., Laina, I., & Prisacariu, V. A. (2024). When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024).

MLA Style

Ma, X, et al. “When LLMs Step into the 3D World: a Survey and Meta-Analysis of 3D Tasks via Multi-Modal Large Language Models.” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), 2024.

Chicago Style

Ma, X, Y Bhalgat, B Smart, S Chen, X Li, J Ding, J Gu, et al. 2024. “When LLMs Step into the 3D World: a Survey and Meta-Analysis of 3D Tasks via Multi-Modal Large Language Models.” In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024). IEEE.
Print

Access Document

Files:: Ma_et_al_2024_When_LLMs_step.pdf

(Preview, Accepted manuscript, pdf, 1.9MB, Terms of use)

Authors

+ Ma, X More by this author

Role:: Author

+ Bhalgat, Y More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Smart, B More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Chen, S More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Li, X More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

More authors...

Publisher:: IEEE
Acceptance date:: 2024-02-26
Event title:: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
Event location:: Seattle, WA, USA
Event website:: https://cvpr.thecvf.com/
Event start date:: 2024-06-17
Event end date:: 2024-06-21

Language:: English
Keywords:: vision language models

computer vision

3D scene understanding

large language models
Pubs id:: 2013444
Local pid:: pubs:2013444
Deposit date:: 2024-07-10
ARK identifier:: ark:/29072/ora_1384745ffe864759ab940724b157111c

Terms of use

Notes:: This paper was presented at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), 17th-21st June 2024, Seattle, WA, USA. This is the accepted manuscript version of the article. The final version will be available online from a forthcoming edition of the conference proceedings.

Licence:: Terms and Conditions of Use for Oxford University Research Archive

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal Large Language Models

Actions

Access Document

Authors

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions