##############
### README ###
##############

This archive contains the raw data and scripts necessary to reproduce (and expand upon) the SNP calling evaluation available as:

Bush, et al. Genomic diversity affects the accuracy of bacterial SNP calling pipelines. https://www.biorxiv.org/content/10.1101/653774v1

First unzip all files in this dataset into one directory, varcall_evaluation. See "file_tree.txt" for its structure.

Perl scripts are to be used in numerical order. The purpose of, and prerequisites for, each script are detailed in the header of each and briefly recapitulated here. See also in-code comments for further information.
All scripts are provided as is, for reproducibility purposes only. No attempt has been made to optimise the code.

Script 1 is used to align each Nanopore/Illumina hybrid assembly (in "nanopore_illumina_hybrid_assemblies") to its representative genome (in "representative_genomes_for_nanopore_illumina_hybrid_assemblies"), and then create the nucmer & ParSnp SNP calls. This script is for reference only; its output (nucmer & ParSnp VCFs) is already included in "nanopore_illumina_hybrid_assemblies".
Script 2 generates a .sh that contains the bash commands necessary to run every pairwise combination of read aligner and variant caller. These bash commands populate an output directory ("vcfs") of which a small example is included in this archive, alongside the shell script that created it, run_SNPcalling_pipelines.sh.
Script 3 filters the contents of the "vcfs" directory, produced by script 2, in order to populate the directory "filtered_vcfs". A small example is included in this archive, alongside the shell script that created it, filter_VCFs.sh.
Script 4 parses the contents of the "filtered_vcfs" directory to generate two summary files, summary_statistics.tsv and pipeline_rankings.tsv, within the output directory "evaluationOutput" or "evaluationOutput-repeatmasked" (a parameter in script 4 determines which is created).

In addition, R commands for the original figures are detailed within the directory "commandsForOriginalFigures", provided alongside the relevant input tables.
These include:
1. The subdirectory "average_f_score_for_all_pipelines_per_strain_vs_strain_uniqueness", which contains tables comprising the (simulated) data used to create one of the central figures in the paper, figure 3 (alongside supplementary figures 1 and 2).
2. Most notably, the original "evaluationOutput" and "evaluationOutput-repeatmasked" directories, i.e. full output of the SNP calling evaluation: 209 pipelines evaluated on 18 genomes. Within these directories are the files summary_statistics.tsv and pipeline_rankings.tsv; these correspond to Supplementary Tables 9 and 10, respectively, with data from the former used to draw Figure 7.

Finally, I would like to draw your attention to:
Lampa, et al. (2013) Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data. GigaScience. 2(1):9. https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-9
Specifically, the eighth lesson learned - that getting users to share scripts is difficult.
"It was hoped that users encountering similar problems could benefit from sharing their experiences through the scripts they used to control NGS software. Unfortunately, getting users to share scripts is difficult. Even if stating explicitly that scripts are to be provided as is, users are reluctant to submit scripts for public usage. Interviews have revealed that users do not have time to add proper documentation or cleanup their code or fear that they will be harassed with questions so they avoid sharing scripts."
i.e. I'm explicitly stating that these scripts are provided as is, and I'm supporting that decision with citation, too.

Further information regarding the hybrid assemblies
***************************************************

The Nanopore/Illumina hybrid assemblies were generated using methods detailed in https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000294.
This paper employed four different assembly strategies.
The assemblies within (each subdirectory within) the directory "nanopore_illumina_hybrid_assemblies" are those for which the assembly strategy is "corrected". This strategy entailed read error-correction and subsampling, as follows:

ONT fast5 read files were base-called with Albacore (v2.0.2, https://github.com/JGI-Bioinformatics/albacore), with barcode demultiplexing and fastq output.
Adapter sequences were trimmed with Porechop (v0.2.2, https://github.com/rrwick/Porechop).
Read quality was calculated with nanostat (v0.22, https://github.com/wdecoster/nanostat).
Long reads were error-corrected and subsampled (preferentially selecting longest reads) to 3040 coverage using Canu (v1.5, https://github.com/marbl/canu).

For further methodological details, please refer to the original publication: https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000294.
The full set of assemblies, using multiple different assembly strategies, are available via FigShare: https://doi.org/10.6084/m9.figshare.7649051.

Note also that for the purpose of this archive, only the longest contiguous sequence in each assembly is included, under the assumption that this represents the core genome.

Each of these sequences is aligned to the representative genome using nucmer and ParSnp, so as to obtain consensus SNP calls.
The VCFs, of variants called relative to the representative genome, are also within the (subdirectories within) "nanopore_illumina_hybrid_assemblies", and follow the naming convention "X.varcall_relative_to_representative.Y.vcf" where X is assembly name and Y is aligner (nucmer or parsnp).

Two suggestions for expanding this work:
1. When using ParSnp to align to the longest contiguous sequence each representative genome, the following (fatal) error message may occur: "Parsnp requires 2 or more genomes to run, exiting".
This otherwise opaque error concerns a hard-coded filter within ParSnp: it requires that genomes be within +/- 30% (bp) of the reference (discussed here: https://github.com/marbl/parsnp/issues/18 and https://github.com/marbl/parsnp/issues/40). Assemblies that are not substantially complete may fall below this threshold - the longest contiguous sequence could be shorter than the representative genome.
2. SNP calls made using nucmer and ParSnp may be supplemented with those made by progressiveMauve (http://darlinglab.org/mauve/mauve.html). progressiveMauve aligns 2 genomes, generating three files as output: X.xmfa X.xmfa.backbone X.xmfa.bbcols, where X is a user-supplied prefix. SNPs may be exported from X.xmfa using the command, e.g., "java -cp ~/programs/mauve_snapshot_2015-02-13/Mauve.jar org.gel.mauve.analysis.SnpExporter -f X.xmfa -o X". However, we found that exporting SNPs via the command line was bugged, producing incomplete files - this has previously been reported at https://www.biostars.org/p/334459/. One solution is to import X.xmfa into the Mauve GUI instead (you will need the .backbone and .bbcols files in the same directory too), and then export the SNPs manually: Tools > Export > Export SNPs.
