Journal article icon

Journal article

A unified perspective on the dynamics of deep transformers

Abstract:
Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, ℓ2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
Publication status:
Accepted
Peer review status:
Peer reviewed

Actions

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Mathematical Institute
Role:
Author


More from this funder
Funder identifier:
https://ror.org/00rbzpz17
Grant:
ANR-23-IACL-0008
Programme:
“France 2030” program
More from this funder
Funder identifier:
https://ror.org/019w4f821
Grant:
883363
Programme:
Horizon 2020 research and innovation programme
More from this funder
Funder identifier:
https://ror.org/05r0vyz12
Grant:
CEX2020-001105-M
Programme:
María de Maeztu Units of Excellence programme
More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
EP/V051121/1
More from this funder
Funder identifier:
https://ror.org/0472cxd90
Programme:
project WOLF


Publisher:
Springer
Journal:
Foundations of Computational Mathematics More from this journal
Acceptance date:
2026-06-04
EISSN:
1615-3383
ISSN:
1615-3375


Language:
English
Keywords:
Pubs id:
2085622
Local pid:
pubs:2085622
Deposit date:
2026-06-05
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP