Journal article icon

Journal article

AI models collapse when trained on recursively generated data

Abstract:
Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1038/s41586-024-07566-y

Authors


More by this author
Institution:
University of Oxford
Role:
Author
More by this author
Role:
Author
ORCID:
0000-0002-3727-7463
More by this author
Role:
Author
ORCID:
0000-0001-8697-5682


Publisher:
Nature Research
Journal:
Nature More from this journal
Volume:
631
Issue:
8022
Pages:
755-759
Publication date:
2024-07-24
Acceptance date:
2024-05-14
DOI:
EISSN:
1476-4687
ISSN:
0028-0836


Language:
English
Source identifiers:
2137072
Deposit date:
2024-07-25

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP