MATERIALS AND METHODS
The Leipzig NDD research
cohort
An ongoing study at the Institute of Human Genetics at Leipzig
University Medical Center investigates the genetics of rare diseases
using high throughput sequencing (also commonly referred to as Next
Generation Sequencing or NGS). This study was approved by the ethics
committee of the University of Leipzig, Germany (224/16-ek and
402/16-ek) and all individuals or their parents or legal guardians
consented to genetic testing.
All individuals received high throughput sequencing. Genomic DNA was
enriched using either a clinical exome target design (CES; TruSight One
v1 panel, 4811 genes, Illumina, Inc., San Diego, USA) or different exome
target designs (TWIST Human Core Exome Kit, TWIST Bioscience, San
Francisco, CA, USA; BGI Exome capture 59M kit, BGI, Shenzhen, China;
SureSelect Human All Exon V6, Agilent Technologies, Santa Clara,
California; TruSeq DNA Exome and Nextera Rapid Capture Exomes, Illumina,
Inc., San Diego, CA, USA). Resulting libraries were sequenced with
paired end reads (100bp or 150bp) on either Illumina (NextSeq 550 or
NovaSeq 6000) or BGI sequencers (BGISEQ-500). Most samples were
initially sequenced using an affected only (“single”) approach. The
data was first analyzed in a diagnostic setting. In unsolved cases with
research consent and parental DNA samples, we subsequently complemented
the analysis by trio ES and re-evaluated the case.
Research re-evaluation in unsolved cases included specially trained
staff at our institute manually re-checking the data to identify
candidate genes and variants based on gene annotations and literature/
database review. Training and supervision were performed by principal
investigators with long lasting experience in molecular genetics
diagnostics and NDD research.
Establishing a comparable candidate score
(CaSc)
In order to standardize this approach and make results comparable
throughout genes and by different analysts, we established a candidate
score (CaSc) as a set of criteria to evaluate variants in novel
candidate genes for NDD. Roughly, the set of criteria included variant
attributes (missense or loss-of-function (LoF) variants, in
silico (computational) prediction tools, conservation, minor allele
frequency), inheritance aspects (segregation and zygosity, and the
relation to higher pLI-scores(Exome Aggregation Consortium et al., 2016)
for LoF variants and missense Z-scores (Genome Aggregation Database
Consortium et al., 2020)), and the gene’s plausibility for causing NDD.
This was calculated based on reviewing OMIM and PubMed entries to check
if the gene is involved in neurological processes, using MGI (Bult et
al., 2019) and PubMed for animal models, STRING (Szklarczyk et al.,
2019) for protein interactions, GTEx (Consortium, 2020) for gene
expression in central nervous system, as well as a list of resources to
find out if variants in the gene have been described in association with
autism spectrum disorder or NDD (the resources included the DDD study,
the Human Gene Mutation Database (HGMD), ClinVar, cooperation partners,
and entries in GeneMatcher). An overview of the manual version of CaSc
is provided in Figure S1 and a detailed description in the corresponding
preprint (Büttner et al., 2019). Evaluators manually applied these rules
to all candidate variants and focused on following up the highest
scoring and thus most promising genes through matchmaking. Evaluating a
variant took between 5 and 15 minutes. As we moved on to the automatic
version of CaSc, there have been several changes to the scoring logic
(see below and detailed in the Supplementary notes).
Automatic Candidate score
(AutoCaSc)
The manual scoring process showed to be suboptimal regarding time needed
and inter-evaluator variability when applied without extensive prior
training and constant supervision, which is hard to maintain in a
university setting with continuous workforce change. We thus decided to
develop a software tool to remove subjectivity and enable lasting
comparability and continuous quick re-evaluation of older candidate
genes. AutoCaSc is written using the Python programming language and all
code is publicly available through a GitHub repository
(https://github.com/JohannKaspar/AutoCaSc) under a creative
commons license (CC BY-NC-SA 4.0).
The three categories “Inheritance”, “Gene constraint” and “Variant
attributes” were implemented as simple decision trees based on an
updated CaSc logic (detailed in Table S2). For the “Gene plausibility”
category, we reviewed public databases providing information on gene
expression, annotated literature, animal models and reported variants.
Criteria for using a resource were the uniqueness of the used data
sources, its completeness and regular updates or sustained maintenance.
We selected the Genotype-Tissue Expression (GTEx (Consortium, 2020))
project for expression data, Mouse Genome Informatics (MGI (Bult et al.,
2019)) database for model organism data, STRING (Szklarczyk et al.,
2019) database for protein-protein interactions, PubTator Central (PTC)
(Wei et al., 2019) for annotations of PubMed abstracts, DisGeNET (Piñero
et al., 2017) for gene disease association (GDA) and PsyMuKB (Lin et
al., 2019) for reported DNVs. DisGeNET is a partially redundant source
because it calculates GDA based on published literature and from various
repositories and databases. However, we found that the integration of
DisGeNET overall improves the gene plausibility score as it includes
data beyond literature.
When submitting genomic or transcript coding variants to AutoCaSc, they
are annotated using the Ensembl Variant Effect Predictor (VEP) (McLaren
et al., 2016) via the Ensembl REST (representational state transfer) API
(application programming interface). Latest allele counts are queried
through the GraphQL API of gnomAD (Genome Aggregation Database
Consortium et al., 2020). Gene constraint metrics such as pLI and
Z-score(Exome Aggregation Consortium et al., 2016), predicted impact of
the variant on protein function, mode of inheritance, GERP++ (Davydov et
al., 2010) rankscore (RS), frequency in gnomAD (Genome Aggregation
Database Consortium et al., 2020), SIFT (Kumar et al., 2009) converted
RS, MutationTaster (Schwarz et al., 2014) converted RS and
MutationAssessor (Reva et al., 2011) RS are then used to calculate the
variant score. This is then summed with the precalculated gene
plausibility score. Please refer to Figure 1b, Table S2 and Table S3 for
an overview of all criteria and data sources used.
We provide webAutoCaSc (https://autocasc.uni-leipzig.de/) as an
online implementation. The webtool (user interface mock in Figure 1d) is
intended for quick manual scoring of a few variants. It does not store
any query information on the server side. All server access logs are
deleted on a daily basis using a cron command.
The command line implementation of our algorithm (vcfAutoCaSc) uses
slivar (Pedersen et al., 2021) to pre-filter variants in multi-sample
VCF (Danecek et al., 2011) files and automatically scores all passing
variants. Family information in the standard PED pedigree format can be
provided as input. This allows AutoCaSc to be implemented in existing
pipelines for rare variant analysis where it can be used to
automatically prioritize candidates in large datasets.
Synthetic trio benchmark
To evaluate AutoCaSc, we searched for recent publications in high
quality genetic journals describing novel NDD associated genes. We
curated a list of 79 variants in 11 novel NDD genes described in 9
publications from a total of 21 reviewed publications (File S1). We
reasoned that due to their novelty at the time of review (2021-01-18),
these would be comparable to candidate variants that a research lab
would evaluate. Next, we used two publicly available trios of healthy
individuals (CEU and ASH trio from Genome In A Bottle (GIAB) (Zook et
al., 2016)) and incorporated the reviewed variants into these VCFs files
according to their segregation. We then applied vcfAutoCaSc to these
synthetic benchmark trios and evaluated the performance by ranking the
passing variants by their CaSc. For the scope of our evaluation, we
defined two variants in compound heterozygous state as one candidate
finding.
Reevaluation of trio ES
We used AutoCaSc to re-evaluate 93 recent ES trios (TWIST target design)
from our in-house NDD cohort. Evaluation in a diagnostic setting (Zacher
et al., 2021) using the browser-based “Varvis” genomics software
(Limbus Medical Technologies GmbH, Rostock, Germany) revealed a
pathogenic SNV in 16 trios (17.2%) and a pathogenic CNV in three trios
(3.2%). In the remaining cases, no diagnostic variants were reported
and manual analysis in a research setting including the application of
the manual version of CaSc was performed. To evaluate the performance of
vcfAutoCaSc, we produced a cohort multi-sample VCF file (see
Supplementary notes) and annotated it using the above described pipeline
of slivar and vcfAutoCaSc.