Thesis icon

Thesis

How machine learning models encode knowledge – and what we can learn from them

Abstract:

As machine learning systems become increasingly capable--often surpassing human performance--they present both a challenge and an opportunity for understanding. These complex systems may operate in ways that are distinct from human reasoning, making them difficult to interpret. Yet, they also hold the potential to reveal new knowledge and support critical decisions in domains such as healthcare, science, and education.

This thesis examines interpretability as a means of understanding and learning from machine learning models. We focus on two central goals: (1) to understand what and how knowledge is encoded in machine learning models and (2) to extract that knowledge in a form that is meaningful and accessible to humans. To this end, we propose a structured pipeline with four stages: identifying explanation desiderata, locating encoded knowledge or concepts, verifying that the concepts influence model behaviour, and translating the concepts into a human-interpretable form. Within this framework, we adapt existing methods where appropriate and develop new ones where necessary--selecting the approach best suited to the task. Our focus is not only on methodological development but also on understanding how these methods behave and the assumptions they make.

Different parts of the thesis contribute to each stage of this pipeline. We begin by investigating how models encode knowledge: first, by analysing the linear representation hypothesis and then examining the universality of concept representations in multilingual language models. We then shift to the user-facing side of interpretability. We study how to make explanations more user-friendly by leveraging uncertainty to generate realistic and unambiguous explanations. Finally, we apply the full pipeline to develop a framework for extracting novel concepts from AlphaZero and teaching them to chess experts. This final study illustrates how interpretability can help bridge the gap between artificial and human understanding. Together, these contributions advance our understanding of and ability to learn from machine learning systems, laying the groundwork for future research at the intersection of artificial intelligence and human insight.

Actions

Access Document

Files:

Authors

More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Author

Contributors

Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Supervisor
ORCID:
0000-0002-2733-2078


More from this funder
Funder identifier:
https://ror.org/0439y7842
Grant:
EP/S024050/1
Programme:
EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Language:
English
Keywords:
Subjects:
Deposit date:
2025-12-26
ARK identifier:

Terms of use


Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP