Thesis icon

Thesis

Generalisation and optimisation in neural networks

Abstract:
The goal of this thesis is to contribute to our understanding of generalisation and optimisation in neural networks. We ask the following questions:
1. Why can highly expressive neural networks learn functions that generalise?
2. Optimiser hyperparameters can significantly affect generalisation. Why? How can we derive optimisers in a principled manner to maximise performance?

In the first part, we introduce a discrete fully-connected network (DFCN) model that offers useful insights for both questions. We prove a one-to-one correspondence between DFCN architectures and Disjunctive Normal Form (DNF) boolean expressions. This yields an interpretable complexity measure, KDNF(f) (shortest DNF length), which maps to the network's minimum weight norm. We show the prior over functions, P(f), exponentially favours functions with low KDNF(f) Consequently, low-complexity functions are learnable, while high-KDNF(f) functions are not. Finally, we show that weight decay enhances this simplicity bias, acting as a penalty on KDNF(f) to promote learning minimal DNF representations and significantly improve generalisation.

In the second part, we focus on Question 1 through studying inductive biases towards simple functions inherent at initialisation. While explicit in Bayesian posteriors, we demonstrate our predictions empirically extend to networks trained with standard optimisers. We analyse the prior of neural networks on Boolean data. We find the prior, P(f) ≲ 2⁻ᵃᴷ⁽ᶠ⁾, can be controllably weakened by tuning the initial weight variance, σw, to move the network from an ordered (a = 1) to a chaotic (a < 1) regime. The latter leads to poor generalisation as the prior cannot counteract the functional growth. We then reveal this architectural bias follows a universal pattern: the prior probability of a function adheres to Zipf's Law (P(f) ∝ R(f)⁻¹). We then prove this Zipfian prior is a necessary condition for efficient learning.

Finally, in the third part, we focus on Question 2. We show that for wide networks, stochastic optimisers approximate Bayesian inference, and explore when this breaks down, leading to feature learning. We then introduce a framework to quantify feature learning, analysing how optimisers and architectures impact learned representations. We derive a new optimiser from first principles by extending mirror descent to incorporate neural architecture, yielding Automatic Gradient Descent (AGD): a first-order, hyperparameter-free optimiser that trains networks at ImageNet scale and provides a foundation for new architecture-aware algorithms.

Actions

Access Document

Files:

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Chemistry
Sub department:
Sub-Department of Physical and Theoretical Chemistry
Oxford college:
Queen's College
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Physics
Sub department:
Theoretical Physics
Role:
Supervisor
ORCID:
0000-0002-8438-910X
Role:
Examiner
Institution:
University of Oxford
Division:
MPLS
Department:
Physics
Sub department:
Theoretical Physics
Role:
Examiner


More from this funder
Funder identifier:
https://ror.org/0439y7842
Funding agency for:
Mingard, C
Grant:
EP/S513842/1
Programme:
iCASE


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP