Preprint icon

Preprint

Higher-order transformer derivative estimates for explicit pathwise learning guarantees

Abstract:
An inherent challenge in computing fully-explicit generalization bounds for transformers involves obtaining covering number estimates for the given transformer class T. Crude estimates rely on a uniform upper bound on the local-Lipschitz constants of transformers in T, and finer estimates require an analysis of their higher-order partial derivatives. Unfortunately, these precise higher-order derivative estimates for (realistic) transformer models are not currently available in the literature as they are combinatorially delicate due to the intricate compositional structure of transformer blocks.
This paper fills this gap by precisely estimating all the higher-order derivatives of all orders for the transformer model. We consider realistic transformers with multiple (non-linearized) attention heads per block and layer normalization. We obtain fully-explicit estimates of all constants in terms of the number of attention heads, the depth and width of each transformer block, and the number of normalization layers. Further, we explicitly analyze the impact of various standard activation function choices (e.g. SWISH and GeLU). As an application, we obtain explicit pathwise generalization bounds for transformers on a single trajectory of an exponentially-ergodic Markov process valid at a fixed future time horizon. We conclude that real-world transformers can learn from N (non-i.i.d.) samples of a single Markov process’s trajectory at a rate of O (polylog(N)/√ N ) .
Publication status:
Published
Peer review status:
Not peer reviewed

Actions

Access Document

Preprint server copy:
10.48550/arxiv.2405.16563

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Mathematical Institute
Role:
Author
ORCID:
0000-0002-8418-7284
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Mathematical Institute
Role:
Author
ORCID:
0000-0002-6330-5480
More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Mathematical Institute
Role:
Author


More from this funder
Funder identifier:
https://ror.org/01h531d29
Funding agency for:
Kratsios, A
Saqur, R
Grant:
RGPIN-2023-04482


Preprint server:
arXiv
Publication date:
2024-05-26
DOI:


Language:
English
Pubs id:
2282237
UUID:
uuid_324ed2aa-0221-4bc3-b898-ffda49822b62
Local pid:
pubs:2282237
Source identifiers:
W4399115775
Deposit date:
2026-01-23
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP