Identity-by-descent detection across 487,409 British samples reveals fine-scale population structure, evolutionary history, and trait associations

Detection of Identical-By-Descent (IBD) segments provides a fundamental measure of genetic relatedness and plays a key role in a wide range of genomic analyses. We developed a new method, called FastSMC, that enables accurate biobank-scale detection of IBD segments transmitted by common ancestors living up to several hundreds of generations in the past. FastSMC combines a fast heuristic search for IBD segments with accurate coalescent-based likelihood calculations and enables estimating the age of common ancestors transmitting IBD regions. We applied FastSMC to 487,409 phased samples from the UK Biobank and detected the presence of ∼214 billion IBD segments transmitted by shared ancestors within the past 1,500 years. We quantified time-dependent shared ancestry within and across 120 postcodes, obtaining a fine-grained picture of genetic relatedness within the past two millennia in the UK. Sharing of common ancestors strongly correlates with geographic distance, enabling the localization of a sample’s birth coordinates from genomic data. We sought evidence of recent positive selection by identifying loci with unusually strong shared ancestry within recent millennia and we detected 12 genome-wide significant signals, including 7 novel loci. We found IBD sharing to be highly predictive of the sharing of ultra-rare variants in exome sequencing samples from the UK Biobank. Focusing on loss-of-function variation discovered using exome sequencing, we devised an IBD-based association test and detected 29 associations with 7 blood-related traits, 20 of which were not detected in the exome sequencing study. These results underscore the importance of modelling distant relatedness to reveal subtle population structure, recent evolutionary history, and rare pathogenic variation.

within the recall range where all methods are able to provide predictions. B (respectively D). Running time (CPU seconds) using chromosome 20 of the UK Biobank for IBD segments detection within the past 50 generations (respectively 100 generations). Only one thread was used for each method and running time trend lines in logarithmic scale are shown, reflecting differences in the quadratic components of each algorithm. Parameters for all methods were optimized to maximize accuracy and used for both accuracy and running time benchmarking (details in Methods). E. Absolute median error of the maximum likelihood age estimate of GERMLINE in blue (based on segments length only) and the MAP age estimate of FastSMC (no filtering on the IBD quality score in orange, and a minimum IBD quality score of 0.01 in dark red). Only IBD segments longer than the minimum length represented on the X-axis were considered. We ran both algorithms with the same minimum length of 0.001 cM and other parameters from grid search results for a time threshold of 50 generations (see Methods). Error bars correspond to standard error over 10 simulations.

45
Overview of the FastSMC method. The algorithm we developed, called FastSMC, detects IBD segments using a two-step 46 procedure. In the first step (identification), FastSMC uses genotype hashing to rapidly identify IBD candidate segments, which enables us to scale to very large datasets. In the second step (verification), each candidate segment is tested using a coalescent 48 hidden Markov model (HMM), which enables us to improve accuracy, compute the posterior probability that the segment is 49 IBD (IBD quality score), and provide an estimate for the TMRCA in the genomic region.  fraction of genome covered by IBD segments, we observed that 93% of individuals in the cohort have more than 90% of their 126 genome covered by at least one IBD segment in the past 50 generations. In contrast, only ∼4% of individuals have more than 127 90% of their genome covered by at least one IBD segment in the past 10 generations ( Supplementary Fig. 8 A). Looking for geographic patterns, we noticed that, despite the large sample size of the UK Biobank cohort, the average fraction of genome A B IBD sharing (cM) to closest individual in the UKBB 5 th degree cousin 4 th degree cousin 3 rd degree cousin 2 nd degree cousin 1 st degree cousin Median distance (km) to closest individual IBD sharing (cM) to closest individual in the UKBB Individuals (%) genome (in cM) on the X-axis, we report the median distance (km, computed every 10 cM) for all pairs of (sample, closest individual) who shared at least x. Vertical dashed lines indicate the expected value of the total IBD sharing for k-th degree cousins, computed using 2G(1/2) 2(k+1) , where G = 7247.14 is the total diploid genome size (in cM) and k represents the degree of cousin relationship (e.g. k = 2 for second degree cousins, separated by 2(k + 1) generations) (10).
for Y-coordinates, 95% CI=[0.41,0.45], and r=0.31, 95% CI=[0.30,0.33], for X-coordinates). To further dissect the connection 166 between genetics and geography in the UK Biobank, we measured the physical distance between individuals as a function of 167 their predicted genetic and genealogical relationship. We computed the fraction of individuals who find at least one close genetic 168 relative within the UK Biobank dataset, as shown in Fig. 3 A. We observed that almost all individuals (99.8%) find a genetic 169 relative with IBD sharing equivalent to a 5th degree cousin (3.5 cM) or closer relationship, with 64.6% of samples finding 170 a putative 3rd degree cousin (56.6 cM) or closer relative. Furthermore, stronger genetic ties translate into greater proximity 171 of birth locations as shown in Fig. 3 B. For instance, for individuals sharing a fraction of genome equivalent to 3rd degree 172 cousin or closer, the median distance between birth locations is 17 km. Very close genetic relationships are also pervasive in 173 the dataset: about one in four individuals (23.4%) has a relative with genetic sharing equivalent to a 2nd degree cousin (226.5 174 cM) or closer; for these samples the median distance between birth locations is only 5 km. Additional details are shown in 175 Supplementary Fig. 13. These findings provide empirical support to recent hypotheses that extensive segment sharing within 176 genealogical databases may be used to recover the genotypes of target individuals (30), or to re-identify individuals through 177 long-range familial searches (31).

178
Analyzing broader patterns of IBD sharing, we found that individuals living in the North of the country (corresponding to 179 Scotland and North of England) share more common ancestry than in the South, and that more generally regions within Scot-180 land, England, and Wales tend to cluster with other regions within the same country. We estimated the effective population 181 size from 300 years ago within each postcode (see Methods) and detected significant correlation (r = 0. and Supplementary Fig. 15). Large values of the DRC 50 statistic are found at loci where a large number of individuals descend 197 from a small number of recent common ancestors, a pattern that is likely to reflect the rapid increase in frequency of a beneficial 198 allele due to recent positive selection. Although, as expected, the DRC 50 statistic computed in this analysis is strongly correlated 199 (r = 0.67) with the DRC 150 statistic that was computed using fewer samples from a previous UK Biobank data release in (22), 200 the DRC 50 statistic reflects more recent coalescence events than the DRC 150 statistic, and thus more specifically reflects natural 201 selection occurring within recent centuries.

213
IBD sharing reflects sharing of ultra-rare trait-associated variation. Individuals who co-inherit a genomic region IBD 214 from a recent common ancestor are also expected to have identical genomic sequences within that region, with the exception 215 of de-novo mutations and other variants introduced by e.g. non-crossover gene conversion events in the generations leading to 216 their recent common ancestor (44). We thus expect the sharing of IBD segments to be strongly correlated to the sharing of 217 ultra-rare genomic variants (MAF< 0.0001), which tend to be very recent in origin and are usually co-inherited through recent 218 ancestors who carried such variants. We verified this by testing for correlation between the sharing of IBD segments at various 219 time scales and the sharing of rare variants for the ∼50k individuals included in the UK Biobank's initial exome sequencing 220 data release (45) ( Supplementary Fig. 16 A). Specifically, we analyzed mutations that are carried by N out of 99, 920 exome-221 sequenced haploid genomes (for 2 < N < 500), which we refer to as F N mutations (46, 47). We compared the sharing of F N 222 mutations to the sharing of IBD segments in the past 10 generations within all postcodes (Fig. 5 A). We found that there is 223 indeed a strong correlation between the per-postcode sharing of ultra-rare variants and the per-postcode sharing of ancestors IBD segments they shared with individuals in the non-sequenced cohort (we refer to these as putative "LoF-segments", though 237 they will also include sharing of the non-LoF haplotypes because the phase of the LoF is unknown). Note that the majority 238 of these variants are singletons or doubletons (45) and would be excluded from imputation by most current algorithms (50).

239
Then, in the (independent) non-sequenced cohort, we tested for association between carrying a LoF-segment (the LoF-segment 240 burden) and the phenotype known to be associated with that gene (see Methods). This approach would be optimal when IBD 241 individuals carry a LoF variant that arose prior to the TMRCA of their shared segment. We thus tested ten transformations of

244
Using our LoF-segment burden, we replicated 11 out of 14 previously reported (45) associations with 7 blood-related traits at 245 p < 0.05/10 = 0.005 (adjusted for testing of 10 transformations; see Table 1). Strikingly, we found 8 of these associations to 246 be exome-wide significant in the non-sequenced cohort (t-test p < 0.05/(10 × 14, 249), reflecting 14, 249 genes tested using 10 247 transformations, Fig. 5  Motivated by these results, we next expanded our study to all sequenced genes for this same set of primarily blood-related traits.

259
We identified a total of 186 exome-wide significant gene associations (t-test p < 0.05/(10 × 14, 249)) spanning 33 genomic loci 260 in the non-sequenced cohort by only leveraging LoF-segments; these genes were not significant in the exome burden analysis 261 due to insufficient statistical power (see Table 1 for a comparison). We noticed that some loci included multiple significant 262 associations, suggestive of correlation between associated features as is often seen for GWAS signals. We hypothesized that, 263 in some cases, the true underlying causal variant may be better tagged by known high-frequency SNPs, which are likely to 264 have been detected in previous GWAS analyses. Repeating this analysis including previously associated common variants as 265 covariates reduced the number of significant associations to 111 across 29 loci, suggesting that inclusion of significant common

269
Our exome-wide significant signals include the association between platelet count and MPL (p = 1.99 × 10 −7 ), which encodes 270 the thrombopoietin receptor that acts as a primary regulator of megakaryopoiesis and platelet production and has not been 271 previously implicated by genome-wide scans of either rare or common variants. We also detect several associations in genes that 272 were not detected using exome sequencing but have been previously implicated in genome-wide scans for common variants,

273
including associations between eosinophil count, GFI1B (p = 1.92 × 10 −7 ) and RPH3A (p = 7.63 × 10 −14 ) (51, 52), and 274 associations between platelet count, IQGAP2 (p = 6.52 × 10 −8 ) and GP1BA (p = 1.43 × 10 −7 ) (51, 53). We also identify genes 275 that were previously associated with other blood-related phenotypes in other populations, including the association between 276 platelet distribution width and APOA5 (p = 1.94 × 10 −8 ), a gene that encodes proteins regulating the plasma triglyceride 277 levels; common variants linked to this gene have been linked to platelet count in individuals of Japanese descent (55). We detect 278 association between red blood cell distribution width and APOC3 (p = 3.67 × 10 −11 ), a gene encoding a protein that interacts 279 with proteins encoded by other genes (APOA1, APOA4) associated with the same trait. The association between APOC3 280 and platelet count was also detected with our WES-LoF burden analysis (p = 2.13 × 10 −7 ) and by previous studies based on 281 common SNPs (51). We also found an association between CHEK2 and both mean corpuscular haemoglobin (p = 1.43 × 10 −7 ) 282 and mean platelet volume (p = 1.93 × 10 −7 ). This gene plays an important role in tumor suppression and was found to 283 be associated with other blood traits, such as platelet crit, using both exome sequencing (45) and common GWAS SNPs 284 (51), and red blood cell distribution width (52). Overall, this analysis highlights the utility of applying FastSMC on a hybrid 285 sequenced/genotyped cohort to identify novel, rare variant associations and/or characterize known signals in larger cohorts.

287
We developed FastSMC, a new algorithm for IBD detection that scales well in analyses of very large biobank datasets, is more 288 accurate than existing methods, and enables estimating the time to most recent common ancestor for IBD individuals. We 289 leveraged FastSMC to analyze 487, 409 British samples from the UK Biobank dataset, detecting ∼ 214 billion IBD segments 290 transmitted by shared common ancestors in the past ∼ 1, 500 years. This enabled us to obtain high-resolution insight into recent 291 population structure and natural selection in British genomes. Lastly, we used IBD sharing between exome sequenced and non-292 sequenced samples to infer the presence of loss-of-function variants, successfully replicating known burden associations with 293 7 blood-related complex traits and revealing novel gene-based associations.

294
The level of geographic granularity that could be captured by the IBD networks emerging from our analysis underscores the 295 importance of modeling distant relationships in genetic studies. Indeed, detecting IBD segments among close and distant 296 relatives is a key step in many analyses, such as genotype imputation or haplotype phase inference (4, 13, 14), haplotype-297 based associations (11, 12), as well as in the estimation of evolutionary parameters such as recombination rates (56, 57), or 298 mutation and gene conversion rates (44, 58). More broadly, the observed geographic heterogeneity in IBD sharing and fine-299 scale structure is a reminder that human populations substantially deviate from random mating, even within small geographic 300 regions. In particular, efforts to generate optimal sequenced reference panels for imputation (59) may be greatly improved by 301 directly sampling based on distant relatedness. Our findings also provide empirical support to recent hypotheses that relatedness 302 within available genomic databases is sufficiently pervasive to enable recovering the genotypes of target individuals (30), or 303 to re-identify individuals through familial searches (31). As demonstrated by our analysis, leveraging IBD to impute ultra-304 rare variant burden from small sequenced reference panels can be an effective approach to detect novel gene associations in 305 complex traits and diseases. Although our analysis only considered association to individual genes, in principle genome-wide 306 IBD sharing could also be leveraged to enhance common SNP-based risk prediction, which is particularly relevant for non-

307
European cohorts where sequencing is limited.

308
FastSMC inherits some of the limitations and caveats of the GERMLINE and ASMC algorithms. First, as with most IBD 309 detection methods, FastSMC requires the availability of phased data. We note, however, that when very large cohorts are 310 analyzed, long-range computational phasing results in high-quality haplotype estimation with switch errors rates as low as one 311 every several tens of centimorgans (4, 13). This is particularly true in regions that harbor IBD segments, which are of interest 312 in our analysis. Nevertheless, our analysis has shown the presence of substantial regional heterogeneity in the extent to which to the subset of pairs that produce a hash collision at a given segment plus the cost of hashing the genotype data, which is 562 linear in sample size and genome length, thus dramatically reducing the cost of IBD detection. However, a limitation of this 563 strategy is that certain short haplotypes can be extremely common in the population and result in hash collisions across a large 564 fraction of samples, effectively reverting back to a nearly all-pairs analysis and monopolizing computation time (Supplementary 565 Fig. 1, gray). These common haplotypes are likely due to recombination cold-spots and typically contain little variation to 566 classify shared segments. The majority of computation is thus spent processing regions with the least information content. The

567
GERMLINE2 algorithm, which we developed in this work, proposes a novel adaptive hashing approach that adjusts to local 568 haplotype complexity to dramatically reduce computational and memory requirements. GERMLINE2 proceeds as follows: (1) 569 the input haplotype data is divided into small windows containing 16 or 32 SNPs each (depending on memory architecture); for 570 a given window w, (2) all haplotypes are converted to binary sequences and efficiently hashed into bins of identical segments;

571
(3) for each bin that contains more individuals than a fixed threshold (i.e. a low complexity bin) step 2 is recursively performed 572 for window w+1 until no more low complexity bins are found; (4) all pairs of individuals sharing within a bin are then recorded 573 in a separate hash table that stores putative segments; (5) pairs of individuals sharing contiguous windows that are sufficiently 574 long are reported for validation. The primary computation speed-up comes from the recursive hashing step, which requires 575 haplotypes to be sufficiently diverse before they are explored for pairwise analysis and stored ( Supplementary Fig. 1, green).

576
To allow for possible phasing errors, a putative shared segment is maintained through a parameterized number of non-identical 577 windows, and the total number of non-identical windows within the segment is also reported for filtering. Most phasing errors 578 either appear as ''blips", where a phase switch is immediately followed by a switch back, or by single switches followed by which can be used to tune precision and recall, e.g. by allowing a more or less permissive detection of IBD segments. Stringmatching methods (GERMLINE and RaPID) report long, approximately IBS regions and do not produce calibrated estimates of 632 segment quality. A commonly used proxy for the likelihood of a detected IBS segment being IBD is its length, with longer IBS 633 segments being more likely to be IBD than shorter ones. In lack of an interpretable measure of accuracy, precision and recall for 634 the output of GERMLINE and RaPID was thus tuned by using different segment length cut-offs. RefinedIBD and FastSMC, on 635 the other hand, both provide an explicit quantification of segment quality. RefinedIBD outputs a LOD score for each segment, 636 while FastSMC computes a segment's IBD quality score, which is the posterior of the TMRCA being between present time and 637 the user-specified time threshold. We thus used LOD score and IBD quality score to tune precision and recall of RefinedIBD   638 and FastSMC, respectively. Each method presents a number of additional parameters, which we further optimized using a 639 grid-search (see below), so that each method can be run with a set of parameters that is as close to optimal as possible. Despite 640 the extensive tuning, not all accuracy values could be explored by all methods. Namely, some recall values cannot be achieved 641 using realistic parameters, due to factors such as the minimum allowed LOD parameter for RefinedIBD, the time discretisation 642 introduced in FastSMC and the minimum length parameter for all methods. We thus evaluated all algorithms by restricting 643 the comparison to the range of recall values that could be achieved by all methods, which we refer to as common recall range.

644
Furthermore, some of these parameters affect the speed of each algorithm. The parameters we chose for comparing methods are 645 optimized for maximum accuracy, although we avoided parameters that would result in degeneracies (e.g. the minimum length 646 in FastSMC's identification step could be set to values below 0.1 cM, effectively disabling this step and reverting to a pairwise 647 ASMC analysis, which would lead to a higher accuracy and larger recall range, at the cost of unreasonable computation). one simulated dataset and select the best set of parameters. For each method, we then explored the obtained set of parameters to 651 make the algorithm faster while negligibly compromising the accuracy (in most cases, this resulted in a slightly smaller recall 652 range but with a substantial gain in speed). We finally used 10 independent simulated datasets to validate the accuracy and the 653 UK Biobank dataset to measure running time and memory usage (see Methods). Unless specified otherwise, the parameters 654 presented in Supplementary Table 1 were used in all analysis. (autosomal chromosomes only) in regions from different chromosomes or separated by centromeres.

659
Estimating the age of an IBD segment. A common way to estimate the age of an IBD segment is to use its length to obtain 660 a maximum likelihood estimator (MLE). When the time in generations g to the most recent common ancestor is known, the 661 total length of a randomly chosen shared IBD segment follows an exponential distribution with rate 1/2g per Morgan (5).

662
The likelihood function is thus given by L(g) = (2g)e −2lg , where l denotes the segments length in Morgans. As dL dg (g) = 663 2e −2lg (1 − 2lg), the MLE is given byĝ = 1 2l . FastSMC does additional modelling and provides a different age estimate, which 664 consists in the average maximum a posteriori (MAP) estimate of the TMRCA along the segment. We sometimes report segment 665 age estimates in years, rather than generations, assuming 30 years per generation. from that cell corresponding to the most represented cluster. All light gray dots correspond to individuals that are either in 705 clusters containing less than 500 samples or part of a cluster different from the one represented in the grid cell.

706
Prediction of birth location. We randomly sampled two subsets of 10, 000 individuals each from the UK Biobank cohort and 707 used the sharing of recent common ancestors in the past 600 years with the remaining 412, 866 individuals in the UK Biobank 708 dataset to predict their birth locations, applying the K-nearest-neighbors algorithm. One dataset was used to find the optimal 709 value for the parameter K, while the second one was used to validate the results (details shown in Supplementary Fig. 12).

710
We computed pairwise genetic similarity across individuals using either FastSMC-estimated pairwise IBD sharing, or using to have mean 0 and variance 1 for each column. Sharing of very close relatives is highly informative of geographic proximity.

714
We thus excluded ≤3rd degree relatives (2)  We regressed the true X (resp. Y) birth coordinates on the predicted X (resp. Y) birth coordinates using either IBD sharing for both methods, obtaining a stronger correlation when using IBD sharing (r = 0.6 for X coordinate and r = 0.74 for Y 725 coordinate) than when using allele sharing (r = 0.31 for X coordinate and r = 0.43 for Y coordinate, respectively). Correlation 726 for IBD sharing was also stronger without removing close relatives (r = 0.63 for X coordinate and r = 0.77 for Y coordinate, 727 compared to r = 0.40 and r = 0.42 respectively for allele sharing).   We partitioned these pairs depending on the distance in kilometers (kM) in the birth locations of the two individuals (less than 5kM, between 5 and 10kM, between 10 and 25kM, between 25 and 50kM, and then every 50kM up to 500kM). The blue line represents the average IBD sharing across all pairs of random samples, and the green trend lines correspond to the standard error of the mean across all pairs (assuming pairs of individuals are independent, which is approximately true when looking at sharing of very recent segments).
We observe strong correlation between genetic and geographic distance, demonstrating that close relatives tend to be geographically clustered. and from the Data Warehouse for Scotland (data only available on the area level, we computed estimates for postcodes). Regions with no data available are coloured in gray. B. Real physical distances between postcodes (top) and isomap projections using IBD sharing within the past 600 years (bottom), x-axis is longitude and y-axis is latitude. The isomap projection was obtained on the IBD sharing postcode dissimilarity matrix and by considering 8 nearest neighbours. We then applied affine transformations (rotation, translation and scaling) to minimize the root mean squared of the reconstruction error (RMSE) (74). The RMSE obtained is 179 km, 95% CI= [163,196]. shared genome (in cM) on the X-axis, we report the percentage of UK Biobank samples (Y-axis) that share x or more with their closest individual. B. For each value x of total shared genome (in cM) on the X-axis, we report the median distance (km, computed every 10 cM) for all pairs of (sample, closest individual) who shared at least x. Vertical dashed lines indicate the expected value of the total IBD sharing for k-th degree cousins, computed using 2G(1/2) 2(k+1) , where G = 7247.14 is the total diploid genome size (in cM) and k represents the degree of cousin relationship (e.g. k = 2 for second degree cousins, separated by 2(k + 1) generations) (10).
We show results obtained using either all individuals with available geographic data (N = 432, 968; black lines), or all individuals with available geographic data after removing ≤3rd degree relatives (e.g. first degree cousins), detected in (2) Table 2 reports the list of genes within the gene cluster labeled as HLA.
The LoF-segment burden analysis (with SNP adjustment) used 303, 125 UK Biobank samples not included in the exome sequencing cohort. We identified two loci previously reported by Van Hout et al. (45), KLF1 and GMPR (gene labels in black), and two novel associations at HLA and CHEK2 (labels in red).
The LoF-segment burden analysis (with SNP adjustment) used 303, 125 UK Biobank samples not included in the exome sequencing cohort. We identified two previously-reported loci (black labels), KLF1 (detected by Van Hout et al. (45)) and APOC3 (detected by our WES burden analysis), along with two additional (labels in red).

Supplementary Fig. 23
| Quantile-quantile plots for LoF-segment burden association. Quantile-quantile plots for mean platelet (thrombocyte) volume (left) and for the same trait, but with randomly permuted phenotype values (right). We observe no signal in the permuted phenotype analysis, suggesting a well-calibrated test. The genomic inflation factor for the (SNP-adjusted) LoF-segment burden test, calculated as the the ratio between the observed median chi-squared association statistic and the median chi-squared association statistic expected under the null, is 1.982 (similar values were observed for the other traits). Values larger than 1 may be caused by pervasive polygenicity (75) and are also observed in analyses of common variants (1.492 for this trait using summary statistics from (73)) and by population stratification remaining after correcting for principal components (67)  . FastSMC has two main parameters that we tuned: a time threshold (t) to indicate how deep in time the user wants to detect common ancestry, and a minimum length parameter (in cM) (length) for the IBD segments. The time threshold parameter was always set to be the same as the time threshold used for the benchmarking. We tuned three parameters in GERMLINE: the minimum length of IBD segments (in cM) (min, optimized over [0.001, 0.01, 0.1, 0.5, 0.75, 1, 1.5, 3, 5]), the numbers of bits (SNPs) in each window (bits, optimized over [32, 64, 128, 256]), the minimum number of mismatches allowed in each of them (err, optimized over [0, 2, 5, 10, 20]).