2.5 Bioinformatics analysis
Samples were sequenced on the Oxford Nanopore GridION and PromethION instruments until a sufficient average depth of coverage (minimum 8x, with >20x preferred) was reached for variant calling. For the few samples that did not reach this coverage threshold, individual review was performed; and samples were included in the analysis if informative variant calls were present that could be manually confirmed by inspection of the alignment files. Nanopore data were aligned to the ASFV Georgia 2007/1 reference genome (GenBank accession NC_044959.2) using Minimap2 (v2.18-r1015) with the options “-N 1000 -a –eqx -x map-ont” (Li, 2018). Illumina data were aligned to the same reference genome using the Burrows-Wheeler Aligner (v0.7.17) with options “-a -h 2 -Y -M” (Li & Durbin, 2012). Insertions and deletions were called for the subset of samples characterized with Illumina data using Freebayes parallel (v1.3.4) with the option “–standard-filters” (Garrison & Marth, 2012). SNPs for the epidemiological analysis were called using a custom, open-source SNP caller (https://github.com/lakinsm/simple-snp). Variants were required to meet the following thresholds to be considered a true variant: a minimum depth of 10 observed alleles at a given genomic location across the population of samples (DP > 10), a minimum observed alternate allele count of 7 at a given genomic location across the population of samples (AO > 7), and an alternative allele frequency greater than or equal to 70% at a given site within a given sample. Additionally, all single nucleotide polymorphisms described in the data were visually verified to be present in the alignment files by a subject matter expert, and final variant calls were manually corrected to match visual inspection if necessary. Low-quality SNPs located in the 5,000 base pairs flanking the 5’ and 3’ terminal regions of the genome were not included in the analysis.
All publicly available raw data labelled as African Swine Fever Virus whole genome sequence were downloaded from the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA). Genome assemblies labelled as African Swine Fever Virus were downloaded from the NCBI GenBank repository. Genotype II ASFV sequences were selected from the NCBI SRA and GenBank data for comparison against samples from the DR. The selected SRA and GenBank data were evaluated for quality. Sequences that were of questionable quality based on the locations of mutations and degree of relatedness via comparison using multiple pairwise alignment were removed. A total of 54 ASFV genomes from public databases were included in the final analysis (Supplementary File 1) (Farlow et al., 2018; Gallardo et al., 2015; Olesen et al., 2009; Kovalenko et al., 2019; Mazur-Panasiuk et al., 2020; Gilliaux et al. 2018; Xuexia et al., 2019; Olasz et al., 2019; Hakizimana et al., 2021; Mazloum et al., 2021; Jia et al., 2020; Xiong et al., 2019).
NCBI SRA raw data retrieved from NCBI was aligned to the ASFV Georgia 2007/1 reference genome (GenBank accession NC_044959.2) using either the Burrows-Wheeler Aligner (v0.7.17, short-read data) or Minimap2 (v2.18-r1015, long-read data) and variant-called as described above. Consensus sequences including the SNP variants were produced for all DR samples and external NCBI SRA data. The resulting consensus sequences were multiple pairwise aligned against the whole genome sequences from NCBI GenBank using MAFFT (v7.487) (Katoh et al., 2002). Phylogenetic tree construction was performed using RAxML (v8.2.12) with the GTRGAMMA model argument (as determined by model selection using likelihood maximization) and visualized using FigTree (v1.4.4) (Kozlov et al., 2019). A subset of 45 nodes was selected using the Treemmer software (v0.3) to display on the phylogenetic tree in Figure 2 (Menardo et al., 2018). SNP tables were visualized using the vSNP pipeline developed by the USDA.