Generalisation and optimisation in neural networks

Mingard, C

Abstract:: The goal of this thesis is to contribute to our understanding of generalisation and optimisation in neural networks. We ask the following questions:
1. Why can highly expressive neural networks learn functions that generalise?
2. Optimiser hyperparameters can significantly affect generalisation. Why? How can we derive optimisers in a principled manner to maximise performance?

In the first part, we introduce a discrete fully-connected network (DFCN) model that offers useful insights for both questions. We prove a one-to-one correspondence between DFCN architectures and Disjunctive Normal Form (DNF) boolean expressions. This yields an interpretable complexity measure, K_DNF(f) (shortest DNF length), which maps to the network's minimum weight norm. We show the prior over functions, P(f), exponentially favours functions with low K_DNF(f) Consequently, low-complexity functions are learnable, while high-K_DNF(f) functions are not. Finally, we show that weight decay enhances this simplicity bias, acting as a penalty on K_DNF(f) to promote learning minimal DNF representations and significantly improve generalisation.

In the second part, we focus on Question 1 through studying inductive biases towards simple functions inherent at initialisation. While explicit in Bayesian posteriors, we demonstrate our predictions empirically extend to networks trained with standard optimisers. We analyse the prior of neural networks on Boolean data. We find the prior, P(f) ≲ 2⁻ᵃᴷ⁽ᶠ⁾, can be controllably weakened by tuning the initial weight variance, σ_w, to move the network from an ordered (a = 1) to a chaotic (a < 1) regime. The latter leads to poor generalisation as the prior cannot counteract the functional growth. We then reveal this architectural bias follows a universal pattern: the prior probability of a function adheres to Zipf's Law (P(f) ∝ R(f)⁻¹). We then prove this Zipfian prior is a necessary condition for efficient learning.

Finally, in the third part, we focus on Question 2. We show that for wide networks, stochastic optimisers approximate Bayesian inference, and explore when this breaks down, leading to feature learning. We then introduce a framework to quantify feature learning, analysing how optimisers and architectures impact learned representations. We derive a new optimiser from first principles by extending mirror descent to incorporate neural architecture, yielding Automatic Gradient Descent (AGD): a first-order, hyperparameter-free optimiser that trains networks at ImageNet scale and provides a foundation for new architecture-aware algorithms.

Actions

Email

Email this record

Send the bibliographic details of this record to your email address.

Your Email
Please enter the email address that the record information will be sent to.

-
Your message (optional)
Please add any additional information to be included within the email.
Share
Cite

Cite this record

APA Style

Mingard, C. (2025). Generalisation and optimisation in neural networks [PhD thesis]. University of Oxford.

MLA Style

Mingard, C. Generalisation and Optimisation in Neural Networks. 2025. University of Oxford, PhD thesis.

Chicago Style

Mingard, C. 2025. “Generalisation and Optimisation in Neural Networks.” PhD thesis, University of Oxford.
Print