Thesis
Quantifying and mitigating selection bias in probability and nonprobability samples
- Abstract:
-
The field of survey research has undergone dramatic transformation in the last decade, driven by declining response rates, fracturing contactability landscapes, the rise of online nonprobability surveys, and simultaneously, increased reliance of data in decision-making. These dynamics have changed the patterns of selection bias, or the systematic exclusion or differential participation of population subgroups, in surveys that have led to high-profile failures of polls to predict election outcomes, and declining trust in survey research. Across four distinct papers, this thesis develops novel approaches for preventing, quantifying, addressing, and communicating selection bias.
Selection bias can be introduced at any point of the data collection process – from the design of data collection, through analysis and presentation of results. The methods introduced in this thesis address selection bias throughout the data collection process, and aim to shift focus from reactive adjustment to proactive prevention and communication.
The first paper demonstrates the mathematical inefficiency of compensating for poor data quality by increasing data quantity. Analyzing three prominent COVID-19 vaccination surveys using Meng's data defect correlation framework, we show that a survey with 250,000 weekly responses can be no more accurate than a high-quality survey with 1,000 responses, representing a 99.99% reduction in bias-adjusted effective sample size. Selection bias introduced early in data collection dominates overall error due to amplification by large population sizes.
The second paper introduces Active Learning Sampling Design (ALSD), which leverages tools from active learning and Baeysian optimization literature to design samples that adapt to observed heterogeneous nonresponse patterns. We demonstrate improved accuracy in simulations, and real-world implementation with the World Food Programme in Zimbabwe.
The third paper develops leverage, a metric quantifying the sensitivity of survey estimates to uncertainty in population benchmarks used for adjustment (e.g. weighting). Most surveys assume population targets are known with certainty, but these are often estimated from other surveys with uncertainty. Leverage provides tools for assessing this overlooked uncertainty and constructing more realistic confidence intervals for weighted estimates.
The final paper applies structural causal models (SCMs) to selection bias in the neurological imaging cohort of the UK Biobank, comparing traditional and novel adjustment methods including a weighting approach using Bayesian Additive Regression Trees (BART). Through simulations and applications, we demonstrate the importance of adjusting for selection bias even in large data sets.
Actions
Authors
Contributors
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Computer Science
- Role:
- Supervisor
- Role:
- Supervisor
- Institution:
- University of Oxford
- Division:
- MPLS
- Department:
- Statistics
- Role:
- Examiner
- Role:
- Examiner
- Funder identifier:
- https://ror.org/0439y7842
- Grant:
- EP/L016710/1
- Programme:
- EPSRC and MRC Centre for Doctoral Training in Next Generation Statistical Science: The Oxford-Warwick Statistics Programme
- DOI:
- Type of award:
- DPhil
- Level of award:
- Doctoral
- Awarding institution:
- University of Oxford
- Language:
-
English
- Keywords:
- Subjects:
- Deposit date:
-
2025-06-30
Terms of use
- Copyright holder:
- Valerie C Bradley
- Copyright date:
- 2023
If you are the owner of this record, you can report an update to it here: Report update to this record