MATERIALS AND METHODS

The Leipzig NDD research cohort

An ongoing study at the Institute of Human Genetics at Leipzig University Medical Center investigates the genetics of rare diseases using high throughput sequencing (also commonly referred to as Next Generation Sequencing or NGS). This study was approved by the ethics committee of the University of Leipzig, Germany (224/16-ek and 402/16-ek) and all individuals or their parents or legal guardians consented to genetic testing.
All individuals received high throughput sequencing. Genomic DNA was enriched using either a clinical exome target design (CES; TruSight One v1 panel, 4811 genes, Illumina, Inc., San Diego, USA) or different exome target designs (TWIST Human Core Exome Kit, TWIST Bioscience, San Francisco, CA, USA; BGI Exome capture 59M kit, BGI, Shenzhen, China; SureSelect Human All Exon V6, Agilent Technologies, Santa Clara, California; TruSeq DNA Exome and Nextera Rapid Capture Exomes, Illumina, Inc., San Diego, CA, USA). Resulting libraries were sequenced with paired end reads (100bp or 150bp) on either Illumina (NextSeq 550 or NovaSeq 6000) or BGI sequencers (BGISEQ-500). Most samples were initially sequenced using an affected only (“single”) approach. The data was first analyzed in a diagnostic setting. In unsolved cases with research consent and parental DNA samples, we subsequently complemented the analysis by trio ES and re-evaluated the case.
Research re-evaluation in unsolved cases included specially trained staff at our institute manually re-checking the data to identify candidate genes and variants based on gene annotations and literature/ database review. Training and supervision were performed by principal investigators with long lasting experience in molecular genetics diagnostics and NDD research.

Establishing a comparable candidate score (CaSc)

In order to standardize this approach and make results comparable throughout genes and by different analysts, we established a candidate score (CaSc) as a set of criteria to evaluate variants in novel candidate genes for NDD. Roughly, the set of criteria included variant attributes (missense or loss-of-function (LoF) variants, in silico (computational) prediction tools, conservation, minor allele frequency), inheritance aspects (segregation and zygosity, and the relation to higher pLI-scores(Exome Aggregation Consortium et al., 2016) for LoF variants and missense Z-scores (Genome Aggregation Database Consortium et al., 2020)), and the gene’s plausibility for causing NDD. This was calculated based on reviewing OMIM and PubMed entries to check if the gene is involved in neurological processes, using MGI (Bult et al., 2019) and PubMed for animal models, STRING (Szklarczyk et al., 2019) for protein interactions, GTEx (Consortium, 2020) for gene expression in central nervous system, as well as a list of resources to find out if variants in the gene have been described in association with autism spectrum disorder or NDD (the resources included the DDD study, the Human Gene Mutation Database (HGMD), ClinVar, cooperation partners, and entries in GeneMatcher). An overview of the manual version of CaSc is provided in Figure S1 and a detailed description in the corresponding preprint (Büttner et al., 2019). Evaluators manually applied these rules to all candidate variants and focused on following up the highest scoring and thus most promising genes through matchmaking. Evaluating a variant took between 5 and 15 minutes. As we moved on to the automatic version of CaSc, there have been several changes to the scoring logic (see below and detailed in the Supplementary notes).

Automatic Candidate score (AutoCaSc)

The manual scoring process showed to be suboptimal regarding time needed and inter-evaluator variability when applied without extensive prior training and constant supervision, which is hard to maintain in a university setting with continuous workforce change. We thus decided to develop a software tool to remove subjectivity and enable lasting comparability and continuous quick re-evaluation of older candidate genes. AutoCaSc is written using the Python programming language and all code is publicly available through a GitHub repository (https://github.com/JohannKaspar/AutoCaSc) under a creative commons license (CC BY-NC-SA 4.0).
The three categories “Inheritance”, “Gene constraint” and “Variant attributes” were implemented as simple decision trees based on an updated CaSc logic (detailed in Table S2). For the “Gene plausibility” category, we reviewed public databases providing information on gene expression, annotated literature, animal models and reported variants. Criteria for using a resource were the uniqueness of the used data sources, its completeness and regular updates or sustained maintenance. We selected the Genotype-Tissue Expression (GTEx (Consortium, 2020)) project for expression data, Mouse Genome Informatics (MGI (Bult et al., 2019)) database for model organism data, STRING (Szklarczyk et al., 2019) database for protein-protein interactions, PubTator Central (PTC) (Wei et al., 2019) for annotations of PubMed abstracts, DisGeNET (Piñero et al., 2017) for gene disease association (GDA) and PsyMuKB (Lin et al., 2019) for reported DNVs. DisGeNET is a partially redundant source because it calculates GDA based on published literature and from various repositories and databases. However, we found that the integration of DisGeNET overall improves the gene plausibility score as it includes data beyond literature.
When submitting genomic or transcript coding variants to AutoCaSc, they are annotated using the Ensembl Variant Effect Predictor (VEP) (McLaren et al., 2016) via the Ensembl REST (representational state transfer) API (application programming interface). Latest allele counts are queried through the GraphQL API of gnomAD (Genome Aggregation Database Consortium et al., 2020). Gene constraint metrics such as pLI and Z-score(Exome Aggregation Consortium et al., 2016), predicted impact of the variant on protein function, mode of inheritance, GERP++ (Davydov et al., 2010) rankscore (RS), frequency in gnomAD (Genome Aggregation Database Consortium et al., 2020), SIFT (Kumar et al., 2009) converted RS, MutationTaster (Schwarz et al., 2014) converted RS and MutationAssessor (Reva et al., 2011) RS are then used to calculate the variant score. This is then summed with the precalculated gene plausibility score. Please refer to Figure 1b, Table S2 and Table S3 for an overview of all criteria and data sources used.
We provide webAutoCaSc (https://autocasc.uni-leipzig.de/) as an online implementation. The webtool (user interface mock in Figure 1d) is intended for quick manual scoring of a few variants. It does not store any query information on the server side. All server access logs are deleted on a daily basis using a cron command.
The command line implementation of our algorithm (vcfAutoCaSc) uses slivar (Pedersen et al., 2021) to pre-filter variants in multi-sample VCF (Danecek et al., 2011) files and automatically scores all passing variants. Family information in the standard PED pedigree format can be provided as input. This allows AutoCaSc to be implemented in existing pipelines for rare variant analysis where it can be used to automatically prioritize candidates in large datasets.

Synthetic trio benchmark

To evaluate AutoCaSc, we searched for recent publications in high quality genetic journals describing novel NDD associated genes. We curated a list of 79 variants in 11 novel NDD genes described in 9 publications from a total of 21 reviewed publications (File S1). We reasoned that due to their novelty at the time of review (2021-01-18), these would be comparable to candidate variants that a research lab would evaluate. Next, we used two publicly available trios of healthy individuals (CEU and ASH trio from Genome In A Bottle (GIAB) (Zook et al., 2016)) and incorporated the reviewed variants into these VCFs files according to their segregation. We then applied vcfAutoCaSc to these synthetic benchmark trios and evaluated the performance by ranking the passing variants by their CaSc. For the scope of our evaluation, we defined two variants in compound heterozygous state as one candidate finding.

Reevaluation of trio ES

We used AutoCaSc to re-evaluate 93 recent ES trios (TWIST target design) from our in-house NDD cohort. Evaluation in a diagnostic setting (Zacher et al., 2021) using the browser-based “Varvis” genomics software (Limbus Medical Technologies GmbH, Rostock, Germany) revealed a pathogenic SNV in 16 trios (17.2%) and a pathogenic CNV in three trios (3.2%). In the remaining cases, no diagnostic variants were reported and manual analysis in a research setting including the application of the manual version of CaSc was performed. To evaluate the performance of vcfAutoCaSc, we produced a cohort multi-sample VCF file (see Supplementary notes) and annotated it using the above described pipeline of slivar and vcfAutoCaSc.