Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study

Summary Background Diagnosing drug-resistance remains an obstacle to the elimination of tuberculosis. Phenotypic drug-susceptibility testing is slow and expensive, and commercial genotypic assays screen only common resistance-determining mutations. We used whole-genome sequencing to characterise common and rare mutations predicting drug resistance, or consistency with susceptibility, for all first-line and second-line drugs for tuberculosis. Methods Between Sept 1, 2010, and Dec 1, 2013, we sequenced a training set of 2099 Mycobacterium tuberculosis genomes. For 23 candidate genes identified from the drug-resistance scientific literature, we algorithmically characterised genetic mutations as not conferring resistance (benign), resistance determinants, or uncharacterised. We then assessed the ability of these characterisations to predict phenotypic drug-susceptibility testing for an independent validation set of 1552 genomes. We sought mutations under similar selection pressure to those characterised as resistance determinants outside candidate genes to account for residual phenotypic resistance. Findings We characterised 120 training-set mutations as resistance determining, and 772 as benign. With these mutations, we could predict 89·2% of the validation-set phenotypes with a mean 92·3% sensitivity (95% CI 90·7–93·7) and 98·4% specificity (98·1–98·7). 10·8% of validation-set phenotypes could not be predicted because uncharacterised mutations were present. With an in-silico comparison, characterised resistance determinants had higher sensitivity than the mutations from three line-probe assays (85·1% vs 81·6%). No additional resistance determinants were identified among mutations under selection pressure in non-candidate genes. Interpretation A broad catalogue of genetic mutations enable data from whole-genome sequencing to be used clinically to predict drug resistance, drug susceptibility, or to identify drug phenotypes that cannot yet be genetically predicted. This approach could be integrated into routine diagnostic workflows, phasing out phenotypic drug-susceptibility testing while reporting drug resistance early. Funding Wellcome Trust, National Institute of Health Research, Medical Research Council, and the European Union.


S3
The 23 candidate-genes were identified in a review of the published literature by SF and SN, who used their expert opinion / judgement to compile a list of high-confidence resistancedetermining mutations. This was done prior and hence independent of this study. These were nevertheless taken as a starting point for our study. Genes are listed below together with the unique Pubmed identifiers for the source papers. Each gene was included on the basis of at least one plausible resistance-determining mutation. The publications linked to the pubmed identifiers are listed below the table.

Genes
Pubmed uid aphC

S4&
100 iterations of the algorithm, each time randomly assigning 1825 sequences to a trainingset and 1826 to a validation-set. Genotypic predictions for the phenotypes in each of the 100 validation-sets are shown. As in table 1, algorithmic characterisations are based on: R (resistance-determining mutation); Rx (resistance-determinant as a mixed base-call); S 0 (zero mutations present); Ss (only benign mutations present); U (uncharacterised mutation present in the absence of a resistance-determining mutation).

S5
Phylogeny defining SNPs removed by the algorithm: SNPs listed in either Coll et. al. (25176035[uid]) or Feuerriegel et. al. (24458512[uid]), that were in the 23 candidate-genes, and that occurred in any of the 3651 sequences, were idenified and set aside by the algorithm. Additional lineage or sub-lineage associated SNPs not listed in the referenced papers were identified manually and also set aside (source left as blank in

S6#
Each step in the algorithm is depicted here as a schematic. First, all sequences were mapped to the H37Rv M. tuberculosis reference genome. Across the genes of interest and their promoter regions (defined as 100 base-pairs upstream), all base-calls were identified as either the same as reference, different from reference, or 'null' (N) where no base could be called because of insufficient coverage or because there was evidence for more than one base ('mixed-call'). Five hypothetical sequences are shown here (Seq 1-5).
Where one or more sequences differed from the reference just on the basis of a null-call, and all others are identical to the reference, that nucleotide position (be it upstream or within a gene) was set aside and not considered of further interest. Positions at which at least one sequence had a SNP in a promoter region, or a SNP leading to non-synonymous amino-acid substitution, were identified for further analysis. Where codons differed from the reference only on the basis of one or more sequences that had a synonymous SNP within the codon, they were also set aside.
All loci in 23 genes that don't match the reference Gene (coding region)

H37Rv) A) C) G) T) C) C) T) T)) G)
All loci in 23 genes that don't match the reference Among the remaining mutations (SNPs, AA substitutions and indels), some were identified as lineage, or phylogeny defining and therefore set aside. Others were only seen in sequences / isolates that had not been phenotyped to the relevant drug. These were also set aside. The remaining mutations were then characterised as either resistance-determining or as benign.
Resistance-determinants were identified by focussing on all genes relevant to a drug. In the example below, the five genes believed to be relevant to isoniazid are shown for a single sample / sequence. Each position within each gene is marked by a coloured line where the reference mutation has not been called. After the null-calls are set aside from otherwise invariant positions, the synonymous mutations are set aside, and the phylogeny/lineage defining mutations ('PhyloSNPs') are set aside, only a single mutation remains (circled). If this sample in question is phenotypically resistant to isoniazid, then this remaining mutation is defined as a resistance-determinant and all other sample containing this mutation (genotypically) predicted to be resistant on this basis.
Were the phenotype to have been susceptible in this example, all other isolates containing the circled mutation would have had to have been inspected before the mutation could be characterised. Two scenarios could lead to the mutation being characterised as benign. First, if all isolates containing the mutation were phenotypically susceptible. Second, if all isolates containing only that mutation were phenotypically susceptible (see below). ! 12! ! As can be seen in the right hand panel, the benign 'blue mutation' occurs in one phenotypically resistant isolate. Benign mutations were therefore set aside before a second iteration of the algorithm was undertaken in an attempt to reveal additional lone-standing resistance-determinants. The process is illustrated below where the circled mutation characterised as 'benign' on the basis of other samples (as in the right hand panel above) is set aside to reveal a lone-standing mutation that could now be characterised as resistancedetermining, in this resistant sample.
When new sequences are added to a training-set, this has the potential to lead to the recharacterisation of mutations. For example, adding an eighth sample containing the 'blue mutation' to the left hand panel above would lead to the re-characterisation of that mutation as 'resistant' if the phenotype for that eighth sample were resistant. Resistant mutations could however only be re-characterised as benign where their original characterisation as 'resistant' was dependent on another mutation being characterised as benign (as immediately above) and where the mutation is otherwise always associated with a susceptible phenotype when occurring as the only mutation in a sample (as in the right hand panel above).

S6.1#
The full results of the algorithm as applied to the original training set are shown below.

S8
The 120 resistance-determinants derived from the original training-set are shown. Each mutation can be counted more than once if it is characterised as a resistance-determinant to more than one drug. The number of resistant and susceptible phenotypes with which each resistance-determinant is associated is shown, and a Pubmed identifier is given if the mutation has been described as resistance-determining before. 79/120 (66%) resistancedeterminant/drug combinations had previously been described in the literature. This compared to 31/772 (4%) 'benign' mutations and 20/71 (28%) mutations defining lineages not intrinsically drug-resistant (counting each mutation once for each drug it was characterised for, i.e. some counting some more than once).

S8.1%
Characterisation of mutations previously identified in the literature as resistance-determinants. The 'literature' was defined as any mutation listed in the Dream TB database project (https://tbdreamdb.ki.se/Info/), 4 any mutation contained on one of the three line-probe assays (which also cover those identified by the Cepheid MTB/RIF Xpert). An additional manual search was also performed for any of the 120 resistance-determinants that did not match any of these sources. Analysing the literature like this of course introduces bias, as we look harder for evidence of the resistance-determining mutations in the literature than we do for the other mutations. However, that lineage-defining mutations have been characterised as resistancedeterminants in the published literature at all is evidence of noise, only more of which would be found by a further manual trawl for mutations. Each mutation is counted for each drug it is associated with (i.e. some mutations are counted more than once).

S10
The performance of mutations characterised in the training-set as resistance-determinants when predicting phenotypes in the validation-set.

Phenotypically resistant
Phenotypically sensitive

S16$
Phenotypic predictions for all 3651 isolates, based on mutations derived from applying the algorithm to the same 3651 isolates. Genotypic predictions for all phenotypes for all 3651 isolates. Genotypic predictions based on: R (resistance-determinant); Rx (resistance-determinant as a mixed base-call); S0 (zero mutations present); Ss (only benign mutations present); U (uncharacterised mutations present). Weighted mean sensitivity and specificity presented as both an overall figure and as that based on characterised mutations only (R, Rx, S0, Ss columns)

S17$
Homoplasy rates vs. the proportion of phenotypic-resistance for mutations in the 23-candidate genes and the rest of the genome. ! ! $ Mutations within the candidate-genes are shown as red (characterised as resistancedeterminants) and green (others). Mutations in the rest of the genome are shown in grey. Curves represent the mean for mutations across candidate-genes and across the rest of the genome. The mutation katG S315T emerged 180 times but has been set to 90, equal to the next most homoplastic mutation, to fit all data to the current scale. The proportion resistant has not been changed and shape of the curve of the mean for candidate-genes remains unaltered. Homoplasy rates against proportion phenotypically resistant for variants in the 23-candidate genes and the rest of the genome. Variants within the candidate-genes are shown as red (classified as resistance-determinants) and green (others). Curves represent the mean across candidate-genes versus the rest of the genome. The variant katG S315T emerged 180 times but has been set to 90, equal to the next most homoplastic variant, to fit all data to the current scale. The proportion resistant has not been changed and shape of the curve of the mean for candidate-genes remains unaltered.