Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training

Lu, Y; Armour, W

AI Collection

Conference item

Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training

Abstract:: Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively ``washes out" the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric. Through extensive benchmarks, we show that FOP accelerates convergence by ×1.2–1.3 over K-FAC and ×1.5–1.7 over SGD/AdamW at the same moderate batch sizes, while at extreme scales it achieves up to a ×7.5 speedup. Unlike other methods, FOP maintains small-batch accuracy when scaling to extremely large batch sizes. Moreover, it reduces Top-1 error by 2.3–3.3% on long-tailed CIFAR benchmarks, demonstrating robust generalization under severe class imbalance. Our lightweight, geometry-aware use of intra-batch variance makes natural-gradient optimization practical on modern data-centre GPUs. FOP is open-source and pip-installable, which can be integrated into existing training code with a single line and no extra configuration.

Publication status:: Published

Peer review status:: Peer reviewed

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Lu, Y., & Armour, W. (2026). Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training. 40th AAAI Conference on Artificial Intelligence (AAAI 2026), 40(29), 24115–24123.

MLA Style

Lu, Y, and W Armour. “Beyond the Mean: Fisher-Orthogonal Projection for Natural Gradient Descent in Large Batch Training.” 40th AAAI Conference on Artificial Intelligence (AAAI 2026), vol. 40, no. 29, 2026, pp. 24115–23.

Chicago Style

Lu, Y, and W Armour. 2026. “Beyond the Mean: Fisher-Orthogonal Projection for Natural Gradient Descent in Large Batch Training.” In 40th AAAI Conference on Artificial Intelligence (AAAI 2026), 40:24115–23. Association for the Advancement of Artificial Intelligence.
Print

Access Document

Files:: Lu_and_Armour_2026_Beyond_the_mean.pdf

(Preview, Accepted manuscript, pdf, 633.5KB, Terms of use)

Publisher copy:: 10.1609/aaai.v40i29.39590

Authors

+ Lu, Y More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author

+ Armour, W More by this author

Institution:: University of Oxford
Division:: MPLS
Department:: Engineering Science
Role:: Author
ORCID:: 0000-0003-1756-3064

+ Department for Science, Innovation and Technology More from this funder

Funder identifier:: https://ror.org/028z36n30
Grant:: EP/T022205/1

Publisher:: Association for the Advancement of Artificial Intelligence
Host title:: Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence
Volume:: 40
Issue:: 29
Pages:: 24115-24123
Publication date:: 2026-03-14
Event title:: 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Event location:: Singapore
Event website:: https://aaai.org/conference/aaai/aaai-26/
Event start date:: 2026-01-20
Event end date:: 2026-01-27
DOI:: 10.1609/aaai.v40i29.39590
EISSN:: 2374-3468
ISSN:: 2159-5399
ISBN-10:: 1577359062
ISBN-13:: 9781577359067

Language:: English
Keywords:: maxima and minima

gradient descent

generalization

projection

convergence

noise

stochastic gradient descent

scaling

curvature

video

economics

relational algebra
Pubs id:: 2405333
Local pid:: pubs:2405333
Source identifiers:: W7138078850
Deposit date:: 2026-04-29
ARK identifier:: ark:/29072/ora_27d9f05f9d444f468279d396254c7a2b

Terms of use

Copyright holder:: Association for the Advancement of Artificial Intelligence (www.aaai.org)
Notes:: The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.

Licence:: CC Attribution (CC BY)

Views and Downloads

About views and downloads

If you are the owner of this record, you can report an update to it here: Report update to this record

Conference item

Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training

Actions

Access Document

Authors

Terms of use

Views and Downloads

Altmetrics

Dimensions

Conference item

Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training

Actions

Access Document

Authors

Funding

Bibliographic Details

Item Description

Terms of use

Metrics

Views and Downloads

Altmetrics

Dimensions