Conference item
Beyond the mean: Fisher-orthogonal projection for natural gradient descent in large batch training
- Abstract:
-
Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively ``washes out" the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric. Through extensive benchmarks, we show that FOP accelerates convergence by ×1.2–1.3 over K-FAC and ×1.5–1.7 over SGD/AdamW at the same moderate batch sizes, while at extreme scales it achieves up to a ×7.5 speedup. Unlike other methods, FOP maintains small-batch accuracy when scaling to extremely large batch sizes. Moreover, it reduces Top-1 error by 2.3–3.3% on long-tailed CIFAR benchmarks, demonstrating robust generalization under severe class imbalance. Our lightweight, geometry-aware use of intra-batch variance makes natural-gradient optimization practical on modern data-centre GPUs. FOP is open-source and pip-installable, which can be integrated into existing training code with a single line and no extra configuration.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Accepted manuscript, pdf, 633.5KB, Terms of use)
-
- Publisher copy:
- 10.1609/aaai.v40i29.39590
Authors
- Funder identifier:
- https://ror.org/028z36n30
- Grant:
- EP/T022205/1
- Publisher:
- Association for the Advancement of Artificial Intelligence
- Host title:
- Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence
- Volume:
- 40
- Issue:
- 29
- Pages:
- 24115-24123
- Publication date:
- 2026-03-14
- Event title:
- 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
- Event location:
- Singapore
- Event website:
- https://aaai.org/conference/aaai/aaai-26/
- Event start date:
- 2026-01-20
- Event end date:
- 2026-01-27
- DOI:
- EISSN:
-
2374-3468
- ISSN:
-
2159-5399
- ISBN-10:
- 1577359062
- ISBN-13:
- 9781577359067
- Language:
-
English
- Keywords:
- Pubs id:
-
2405333
- Local pid:
-
pubs:2405333
- Source identifiers:
-
W7138078850
- Deposit date:
-
2026-04-29
- ARK identifier:
Terms of use
- Copyright holder:
- Association for the Advancement of Artificial Intelligence (www.aaai.org)
- Copyright date:
- 2026
- Rights statement:
- Copyright © 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
- Notes:
- The author accepted manuscript (AAM) of this paper has been made available under the University of Oxford's Open Access Publications Policy, and a CC BY public copyright licence has been applied.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record