Towards interpretable sequence continuation: analyzing shared circuits in large language models

Conference item

Abstract:: While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.

Files:: Lan_et_al_2024_Towards_interpretable_sequence.pdf

(Preview, Accepted manuscript, pdf, 3.8MB, Terms of use)

Publisher:: Association for Computational Linguistics
Host title:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Pages:: 12576–12601
Publication date:: 2024-11-01
Acceptance date:: 2024-09-20
Event title:: Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)
Event location:: Miami, Florida, USA
Event website:: https://2024.emnlp.org/
Event start date:: 2024-11-12
Event end date:: 2024-11-16
DOI:: 10.18653/v1/2024.emnlp-main.699

Copyright holder:: Association for Computational Linguistics
Notes:: This paper was presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 12th-16th November 2024, Miami, FL, U.S.A. This is the accepted manuscript version of the article. The final version is available online from Association for Computational Linguistics at: https://dx.doi.org/10.18653/v1/2024.emnlp-main.699

Licence:: Terms and Conditions of Use for Oxford University Research Archive

If you are the owner of this record, you can report an update to it here: Report update to this record