INTRODUCTION

Routine genetic analyses and exome sequencing (ES) are able to clarify the cause of neurodevelopmental disorders (NDD) in about 30-50% of affected individuals(Trujillano et al., 2017; Deciphering Developmental Disorders Study et al., 2020; Shashi et al., 2014; Yavarna et al., 2015; Chong et al., 2015). Due to the extreme heterogeneity of NDD (1,534 primary genes in the SysID(Kochinke et al., 2016) database as of 2021-12-06), this diagnostic yield will increase with identification of novel genetic associations. (Chong et al., 2015; Eilbeck et al., 2017; Kaplanis et al., 2019)
Large screening studies have focused on the identification of de novo variants (DNV) in NDD (Vissers et al., 2010). The DDD (Deciphering Developmental Disorders) (Deciphering Developmental Disorders Study et al., 2020) study evaluated 31,058 trio-exomes of individuals with NDD with the aim to identify causal DNVs. By mere statistical enrichment, 28 novel genes (10.0% of 281 total enriched genes) could be associated with NDD. Cases from the DDD cohort were subsequently contributed to several descriptions of novel NDD associations based on clinically and functionally guided matchmaking. Thus it is clear that a more distributed approach is needed that goes beyond analyzing ever large cohorts of only a few centers, one that assesses gene and variant characteristics in, at best, all unsolved trio exomes to identify further NDD entities.
In principle, any rare variant segregating with a disorder can cause a disease. (Najmabadi et al., 2011; Tarpey et al., 2009; Vissers et al., 2010) Wet lab or model organism analyses of all candidate variants would be slow and expensive. Therefore, a systematic procedure for pre-selection of the most promising candidates is needed to close the gap between the large number of candidates and the experimental proof of their causality. Scientific evaluators usually consider aspects of predicted effect on the protein, segregation in the family, minor allele frequency in population databases like gnomAD (Genome Aggregation Database Consortium et al., 2020), functional aspects of the protein and other information from literature and from public sources. (MacArthur et al., 2014) This approach is highly evaluator dependent, making it both subjective and not scalable. Furthermore, it is time-consuming and needs expert knowledge, thus deterring many labs from scientific evaluation of their ES data. A standardized and quick evaluation in a measurable framework is therefore needed.
Computational approaches have been developed to predict novel disease genes based on confirmed gene associations (e.g. ToppGene Suite (Chen et al., 2009), Endeavour (Tranchevent et al., 2016)). Other approaches use evolutionary constraints to predict the importance of a gene for the organism in general (e.g. pLI (Exome Aggregation Consortium et al., 2016)) or the deleteriousness of single genetic variants (e.g. CADD (Kircher et al., 2014)). Few projects (eXtasy (Sifrim et al., 2013), PHIVE (Robinson et al., 2014), Phenolyzer (Yang et al., 2015), OMIM explorer (James et al., 2016), DeepPVP (Boudellioua et al., 2019)) use both variant and gene-specific information for candidate prioritization; a quick review and comparison of the tools combining gene and variant data for candidate gene prediction, showed that they are often unmaintained, unusable or do not offer a visual user interface in the first place (compare Table S1).
Here we describe the development of a candidate scoring scheme (CaSc) specifically intended for the NDD research field and based on commonly used complementary data sources for genes and variants. The webAutoCaSc web tool enables broad applicability while the command line tool vcfAutoCaSc is designed for screening cohorts and pipeline integration. We validated our approach using synthetic trios and real in-house trio ES data. The successful application of these principles and tools at our Institute resulted in >40 descriptions of rare NDD entities. Based on this experience, we recommend a workflow for re-evaluation of sequencing data in unsolved NDD trios.