INTRODUCTION
Routine genetic analyses and exome sequencing (ES) are able to clarify
the cause of neurodevelopmental disorders (NDD) in about 30-50% of
affected individuals(Trujillano et al., 2017; Deciphering Developmental
Disorders Study et al., 2020; Shashi et al., 2014; Yavarna et al., 2015;
Chong et al., 2015). Due to the extreme heterogeneity of NDD (1,534
primary genes in the SysID(Kochinke et al., 2016) database as of
2021-12-06), this diagnostic yield will increase with identification of
novel genetic associations. (Chong et al., 2015; Eilbeck et al., 2017;
Kaplanis et al., 2019)
Large screening studies have focused on the identification of de
novo variants (DNV) in NDD (Vissers et al., 2010). The DDD (Deciphering
Developmental Disorders) (Deciphering Developmental Disorders Study et
al., 2020) study evaluated 31,058 trio-exomes of individuals with NDD
with the aim to identify causal DNVs. By mere statistical enrichment, 28
novel genes (10.0% of 281 total enriched genes) could be associated
with NDD. Cases from the DDD cohort were subsequently contributed to
several descriptions of novel NDD associations based on clinically and
functionally guided matchmaking. Thus it is clear that a more
distributed approach is needed that goes beyond analyzing ever large
cohorts of only a few centers, one that assesses gene and variant
characteristics in, at best, all unsolved trio exomes to identify
further NDD entities.
In principle, any rare variant segregating with a disorder can cause a
disease. (Najmabadi et al., 2011; Tarpey et al., 2009; Vissers et al.,
2010) Wet lab or model organism analyses of all candidate variants would
be slow and expensive. Therefore, a systematic procedure for
pre-selection of the most promising candidates is needed to close the
gap between the large number of candidates and the experimental proof of
their causality. Scientific evaluators usually consider aspects of
predicted effect on the protein, segregation in the family, minor allele
frequency in population databases like gnomAD (Genome Aggregation
Database Consortium et al., 2020), functional aspects of the protein and
other information from literature and from public sources. (MacArthur et
al., 2014) This approach is highly evaluator dependent, making it both
subjective and not scalable. Furthermore, it is time-consuming and needs
expert knowledge, thus deterring many labs from scientific evaluation of
their ES data. A standardized and quick evaluation in a measurable
framework is therefore needed.
Computational approaches have been developed to predict novel disease
genes based on confirmed gene associations (e.g. ToppGene Suite (Chen et
al., 2009), Endeavour (Tranchevent et al., 2016)). Other approaches use
evolutionary constraints to predict the importance of a gene for the
organism in general (e.g. pLI (Exome Aggregation Consortium et al.,
2016)) or the deleteriousness of single genetic variants (e.g. CADD
(Kircher et al., 2014)). Few projects (eXtasy (Sifrim et al., 2013),
PHIVE (Robinson et al., 2014), Phenolyzer (Yang et al., 2015), OMIM
explorer (James et al., 2016), DeepPVP (Boudellioua et al., 2019)) use
both variant and gene-specific information for candidate prioritization;
a quick review and comparison of the tools combining gene and variant
data for candidate gene prediction, showed that they are often
unmaintained, unusable or do not offer a visual user interface in the
first place (compare Table S1).
Here we describe the development of a candidate scoring scheme (CaSc)
specifically intended for the NDD research field and based on commonly
used complementary data sources for genes and variants. The webAutoCaSc
web tool enables broad applicability while the command line tool
vcfAutoCaSc is designed for screening cohorts and pipeline integration.
We validated our approach using synthetic trios and real in-house trio
ES data. The successful application of these principles and tools at our
Institute resulted in >40 descriptions of rare NDD
entities. Based on this experience, we recommend a workflow for
re-evaluation of sequencing data in unsolved NDD trios.