Thesis
How machine learning models encode knowledge – and what we can learn from them
- Abstract:
-
As machine learning systems become increasingly capable--often surpassing human performance--they present both a challenge and an opportunity for understanding. These complex systems may operate in ways that are distinct from human reasoning, making them difficult to interpret. Yet, they also hold the potential to reveal new knowledge and support critical decisions in domains such as healthcare, science, and education.
This thesis examines interpretability as a means of understanding and learning from machine learning models. We focus on two central goals: (1) to understand what and how knowledge is encoded in machine learning models and (2) to extract that knowledge in a form that is meaningful and accessible to humans. To this end, we propose a structured pipeline with four stages: identifying explanation desiderata, locating encoded knowledge or concepts, verifying that the concepts influence model behaviour, and translating the concepts into a human-interpretable form. Within this framework, we adapt existing methods where appropriate and develop new ones where necessary--selecting the approach best suited to the task. Our focus is not only on methodological development but also on understanding how these methods behave and the assumptions they make.
Different parts of the thesis contribute to each stage of this pipeline. We begin by investigating how models encode knowledge: first, by analysing the linear representation hypothesis and then examining the universality of concept representations in multilingual language models. We then shift to the user-facing side of interpretability. We study how to make explanations more user-friendly by leveraging uncertainty to generate realistic and unambiguous explanations. Finally, we apply the full pipeline to develop a framework for extracting novel concepts from AlphaZero and teaching them to chess experts. This final study illustrates how interpretability can help bridge the gap between artificial and human understanding. Together, these contributions advance our understanding of and ability to learn from machine learning systems, laying the groundwork for future research at the intersection of artificial intelligence and human insight.
Actions
Access Document
- Files:
-
-
(Preview, Dissemination version, pdf, 38.3MB, Terms of use)
-
Authors
Contributors
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Computer Science
- Role:
- Supervisor
- ORCID:
- 0000-0002-2733-2078
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- EP/S024050/1
- Programme:
- EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2025-12-26
- ARK identifier:
Terms of use
- Copyright holder:
- Lisa Miou Antoinette Schut
- Copyright date:
- 2025
If you are the owner of this record, you can report an update to it here: Report update to this record