Conference item icon

Conference item

Towards interpretable sequence continuation: analyzing shared circuits in large language models

Abstract:
While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.18653/v1/2024.emnlp-main.699

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Engineering Science
Role:
Author


Publisher:
Association for Computational Linguistics
Host title:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Pages:
12576–12601
Publication date:
2024-11-01
Acceptance date:
2024-09-20
Event title:
Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)
Event location:
Miami, Florida, USA
Event website:
https://2024.emnlp.org/
Event start date:
2024-11-12
Event end date:
2024-11-16
DOI:


Language:
English
Pubs id:
2074856
Local pid:
pubs:2074856
Deposit date:
2025-01-08

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP