Conference item
Towards interpretable sequence continuation: analyzing shared circuits in large language models
- Abstract:
- While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 3.8MB, Terms of use)
-
- Publisher copy:
- 10.18653/v1/2024.emnlp-main.699
Authors
- Publisher:
- Association for Computational Linguistics
- Host title:
- Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
- Pages:
- 12576–12601
- Publication date:
- 2024-11-01
- Acceptance date:
- 2024-09-20
- Event title:
- Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)
- Event location:
- Miami, Florida, USA
- Event website:
- https://2024.emnlp.org/
- Event start date:
- 2024-11-12
- Event end date:
- 2024-11-16
- DOI:
- Language:
-
English
- Pubs id:
-
2074856
- Local pid:
-
pubs:2074856
- Deposit date:
-
2025-01-08
Terms of use
- Copyright holder:
- Association for Computational Linguistics
- Copyright date:
- 2024
- Rights statement:
- © 2024 Association for Computational Linguistics
- Notes:
- This paper was presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 12th-16th November 2024, Miami, FL, U.S.A. This is the accepted manuscript version of the article. The final version is available online from Association for Computational Linguistics at: https://dx.doi.org/10.18653/v1/2024.emnlp-main.699
If you are the owner of this record, you can report an update to it here: Report update to this record