GEM AI outperforms variant prioritization approaches
We benchmarked GEM, an AI-based eCDSS, using a cohort of 119 pediatric retrospective cases from Rady Children’s Institute for Genomic Medicine (RCIGM; benchmark cohort). Most of these were critically ill NICU infants who received genomic sequencing for diagnosis of genetic diseases. All had been diagnosed with one or more Mendelian conditions using a combination of manual filtering and variant prioritization approaches (“Methods”). To further validate performance, we also analyzed a second cohort comprised of 60 non-NICU, rare disease patients from five different academic medical centers (validation cohort). Finally, we reanalyzed a set of 14 previously analyzed probands that had remained undiagnosed by WGS. Our goal was to evaluate the ability of GEM to identify new diagnoses in these previously unsolved cases, without providing false positive findings that would result in time-consuming case reviews. To provide context for our performance benchmarks, we also ran three commonly used variant prioritization tools: VAAST , Phevor , and Exomiser .
The benchmark and validation cohorts included singleton probands, parent-offspring trios, different modes of inheritance, and both small causal variants (SNVs, and small insertions or deletions, indels; Table 1; Additional file 1: Table S1) and large structural variants (SV), some of which were causative (Table 2). In these retrospective analyses, we considered the variants, disease genes, and conditions that were included as primary findings in the clinical report as the “gold standard” truth set.
GEM gene scores are Bayes factors (BF) ; these were used to rank gene candidates (Additional file 2: Figure S1). BFs are widely used in AI, as they concisely quantify the degree of support for a conclusion derived from diverse lines of evidence. In keeping with established practices , a BF of 0–0.69 was considered moderate support, 0.69–1.0 substantial support, 1.0–2.0 strong support, and above 2.0, decisive support . Scores less than 0 indicated support for the counter hypothesis—that variants in that gene were not causal for the proband’s disease. GEM outputs also include several annotations and metrics that provide additional, supportive guidance for subsequent expert case review (Additional file 2: Figure S1). Experience has shown that such guidance is critical for adoption by experts who wish to review the evidence supporting automated variant assertions. These include VAAST, VVP, and Phevor posterior probabilities, conditioned upon proband sex, gene location, and ancestry. Annotations include variant consequence, ClinVar database pathogenicity assessments, and OMIM conditions associated with genes. This metadata enables expert users to review the major contributions underpinning a final GEM score. Moreover, GEM prioritizes diplotypes, rather than variants, which speeds interpretation of compound heterozygous variants in recessive diseases (Additional file 2: Figure S1B). Comparison of the diagnostic performance of GEM to variant prioritization methods utilized ranking of the correct diagnostic gene. We assumed that in the case of compound heterozygotes, variant prioritization methods such as Exomiser would rank one variant of the pair highly, leading to identification of the other upon manual review (“Methods”).
GEM ranked 97% of previously reported causal gene(s) and variant(s) among the top 10 candidates in the 119 benchmark cohort cases. In 92% of cases, it ranked the correct gene and variant in the top 2 (Fig. 1A). By comparison, the next best algorithm, Phevor, identified 73% of causal variants in the top 10 candidates and 59% in the top 2. GEM, Phevor, and Exomiser prioritize results by patient phenotypes (provided as HPO terms) in addition to variant pathogenicity, whereas VAAST only utilizes genotype data, explaining its lower performance. Thus, these data also highlight that patient phenotypes improve the diagnostic performance of automated interpretation tools.
The benchmark cohort included 3 cases for which two genes were reported to contribute to the patient phenotype. This rate (2.5%) is consistent with previous reports for digenic inheritance . The statistics above use the top ranked genes in these cases, but Additional file 1: Table S3 shows that GEM also ranked the second causal gene among its top candidates, whereas Phevor reported poor ranks in one case, and Exomiser missed the second gene in two out of the three cases.
Next, we investigated whether the diagnostic performance of GEM extended to Mendelian diseases other than those of NICU infants, such as patients with later disease onset, less severe presentations, or with data produced by other variant calling pipelines or outpatient genetic clinics. For these analyses, we compiled a validation cohort largely consisting of WES cases from five different academic medical centers (Table 1; Additional file 1: Table S2). The diagnostic performance of GEM in the validation cohort was almost identical to that in the benchmark cohort (Fig. 1B). These data demonstrated that the diagnostic performance of GEM was not dependent of disease severity, age of onset, or genomic sequencing or variant detection methods.
An implication of these findings is that GEM achieved 97% recall (true positive rate) by review of 10 genes, whereas the other tools had < 78% recall by similar review (Fig. 1, Additional file 2: Figure S2). In part, this difference reflected the unique ability of GEM to prioritize SVs. Excluding SV cases, GEM, Phevor, and Exomiser achieved recall of 97%, 83%, and 76%, respectively, by review of 10 genes (Additional file 2: Figure S3A). Furthermore, VAAST and Exomiser failed to provide rankings for 4 and 18 true positive variants, respectively. Exclusion of false negatives and SV cases increased the top 10 recall of Exomiser to 93% (Additional file 2: Figure S3B), in agreement with previous reports . These data show the importance of including all types of cases and causal variants in benchmarking to avoid overestimation of diagnostic performance in real-world clinical applications.
Scoring of structural variants increases diagnostic rate
A major barrier to the incorporation of SV calls into genome diagnostic interpretation, whether manual or using eCDSS, is their low precision (high false positive, FP, rates) using short read alignments, with typical FP rates of 20–30% [87, 88]. This leads to overwhelmingly time-consuming, manual assessment of event quality and significance for large numbers of SVs. GEM minimizes the effect of low precision by scoring SVs either with SV calls provided in the proband’s input VCF file, and/or by inferring ab initio their existence from metadata associated with SNV and indel calls (“Methods”; see below). The benchmark cohort included 20 cases in which SVs were reported to be causative, reflecting a similar incidence to that in real-world experience (Fig. 1A, Table 2) [20,21,22,23]. In 17 of these, the causative SV was ranked first by GEM. In two, it was ranked second, and in one it was listed fourth, demonstrating that GEM retains adequate diagnostic performance with imprecise SV calls. The disease-causing SVs in the benchmark set ranged from small (4 kb) to very large (e.g., entire chromosome arms). In three cases, the diagnosis was of an autosomal recessive disorder in which the SV was compound heterozygous with a SNV/indel. In each, GEM integrated the two variants correctly, automatically identifying the causative diplotypes (Additional file 2: Figure S5). With regard to the diagnostic specificity of GEM, the mean and median number of gene candidates for these probands with BF > 0 (any support) was 8.7 and 9.5, respectively, which was similar to probands whose VCF files contained no SVs, causative or otherwise.
Large SVs frequently affect more than one gene. For consistency with other variant classes, genes within multigenic SVs are grouped and sorted by GEM based upon the gene-centric Bayes factor score associated with the overlap of the proband phenotype and known Mendelian disorders (if any) associated with them (“Methods”). For example, Additional file 2: Figure S4 shows a case that highlights the practical utility of prioritizing genes harboring causative SVs together with SNVs and short indels in the same report, rather than separately cross-referencing with databases of microdeletion syndromes . While it is often unknown which genes harbored in a pathogenic SV are causal for microdeletion/microduplication syndromes, GEM’s gene-by-gene rankings typically agreed with causal gene candidates suggested by the literature (asterisks in Table 2).
By default, GEM evaluates every gene and transcript for the presence of overlapping SVs. Notably, four benchmark cases did not include externally called SVs in their input VCFs (these had been previously diagnosed by manual inspection and orthogonal confirmatory tests; Table 2). Nevertheless, GEM inferred the existence of these four SVs using its ab initio SV identification algorithm and evaluated them jointly with SNVs and indels (“Methods”). To further demonstrate this innovative functionality, we removed all external SV calls from each input VCF file of the 14 WGS cases (as GEMs ab initio SV imputation is currently limited to WGS data) and reran GEM. GEM re-identified 13 of the 14 of the causative SVs. Although GEM’s inferred SV termini were imprecise, an overlapping SV of the same class (duplication, deletion, or CNV) and ploidy to that in the original VCF was inferred, and the same high scoring gene and mode of inheritance/genotype (autosomal dominant, simple recessive, or compound heterozygote) was ranked first. SV recall within the top 1, 5, and 10 ranked GEM results were 71%, 86%, and 93%, respectively. The single false negative was a small (4 kb) homozygous deletion. GEM failed to identify this SV because it did not span sites with known variation in the gnomAD database , upon which ab initio SV inference is based (“Methods”). With regard to specificity, the mean and median number of results with genes with BF > 0 in these cases was 10.6 and 12.5, respectively. These values differed only slightly from the results obtained using external SV calls (8.7 and 9.5, respectively), despite the fact every gene and transcript was evaluated for the presence of SVs.
Collectively, these results demonstrate the accuracy of GEM’s ab initio approach to identification and prioritization of SVs without recourse to external calls and databases of known causative SVs. Thus, GEM compensates, in part, for the low recall of SVs from short-read sequences. If an external SV calling pipeline fails to detect an SV, there is still the possibility that GEM will identify it via this ab initio approach. This capability, together with GEM’s ability to accurately prioritize SVs in the context of SNVs and short indels, addresses an unmet need for clinical applications. This characteristic also makes GEM well suited for reanalyses of older cases and/or pipelines lacking SV calling.
Leveraging automated phenotyping from clinical natural language processing
Ontology-based phenotype descriptions, using Human Phenotype Ontology (HPO) terms , are widely used to communicate the observed clinical features of disease in a machine-readable format. These lists of terms are usually derived by manual review of patient EHR data by trained personnel, a time-consuming, subjective process. A solution is automatic extraction of patient phenotypes from clinical notes using clinical natural language processing (CNLP) [28, 90]. One challenge has been that CNLP generates many more terms than manual extraction. Thus, manual curation yielded an average of 4 HPO terms (min = 1, max = 12) in the benchmark cohort, while CNLP yielded an average of 177 HPO terms (min = 2, max = 684). Some of the extra CNLP terms are hierarchical parent terms of those observed, raising the concern that their inclusion diminishes the average information content in a manner that could impede diagnosis . To investigate the effect of CNLP-derived HPO terms on GEM’s performance, we analyzed the benchmark cohort both with HPO terms extracted by commercial CNLP (“Methods”) and manually extracted HPO terms.
Figure 2 shows the distributions and medians for ranks and GEM gene scores of true positives, as well as the number of gene candidates with BF ≥ 0.69 (moderate support), for manual and CNLP terms. The median rank of the causal genes did not significantly differ between CNLP- and manually derived phenotype descriptions (Fig. 2A). The median GEM gene score of true positives was higher with CNLP-derived phenotypes than with manual phenotypes (Fig. 2B). The number of candidates above the BF threshold was higher with manual phenotypes than CNLP (Fig. 2C). CNLP rescued a few true positives with low ranks and negative BF scores compared to manual phenotype descriptions (Fig. 2A, B). These results demonstrate that GEM performs somewhat better with CNLP-derived phenotype descriptions as part of an automated interpretation workflow, than with sparse, manual phenotypes.
Resilience to mis-phenotyping and gaps in clinical knowledge
Given the potentially noisy nature of the CNLP phenotype descriptions, it was important to examine the sensitivity of GEM to mis-phenotyping. To address this question, we randomly permuted CNLP-extracted HPO terms between cases, weighting by term frequency within the cohort, so that every case maintained the same number of HPO terms as CNLP originally provided. Permuting HPO terms resulted in lower gene scores, and several cases would have been lost for review had the gene score threshold of BF ≥ 0 still been used, but ranks are unaffected (98% in top 10; Fig. 3). This represented lower bound estimates, as actual mis-phenotyping (short of data tracking issues) would be much less. It is also worth noting that even using randomly permuted phenotype descriptions, GEM’s performance still exceeded that of Phevor and Exomiser using the correct phenotypes (Additional file 2: Figure S2). We therefore conclude that GEM is resilient to mis-phenotyping.
We also evaluated the impact of gaps in clinical knowledge on GEM performance by withdrawing annotations from a key clinical database, ClinVar. Absence of ClinVar annotations had minimal impact in ranking, although it reduced median gene scores (1.1 vs. 2.7), resulting in 9 cases no longer meeting the minimum Bayes factor threshold ≥ 0 (any support; Fig. 3). Clearly, ClinVar provided GEM with valuable information. Nonetheless, without ClinVar, GEM’s top 10 maximal recall (88%) still exceeded that of Phevor (72%) and Exomiser (65%; Fig. 1). More broadly, these results show that integrating more datatypes in GEM improves diagnostic performance and results in greater algorithmic stability (Figs. 2 and 3).
About 70% (86/122) of the disease-causing variants in the benchmarking dataset are reported in ClinVar with pathogenic (P) or likely pathogenic (LP) clinical significance annotations. Moreover, each proband’s whole-genome variant set contained on average 1.9 variants with ClinVar P/LP annotations. These two facts underscore the importance of ClinVar annotations for assisting diagnosis. They also make clear that tools that leverage ClinVar information need to avoid false positives which lead to longer candidate lists as non-causal genes also contain ClinVar P/LP variants. Additional file 1: Table S4 breaks down results for the benchmark cohort with respect to ClinVar annotations of causal variants. Overall, mean, and median ranks were slightly improved for diagnostic variants with ClinVar P/LP annotations vs. those without them (mean 1 vs. 3), with GEM showing the greatest improvement in ranks. Moreover, GEM maintained the same number of candidates with GEM gene score > 0 for both classes , demonstrating that GEM can use ClinVar status to improve diagnostic rates without increasing the number of candidates for review.
GEM performs equivalently on parent-offspring trios and single probands
Parent-offspring trios are widely used for molecular diagnosis of rare genetic disease. While a recent study showed that singleton proband sequencing returned a similar diagnostic yield as trios , interpretation of trio sequences is less labor-intensive. For example, trios enable facile identification of de novo variants, which is the leading mechanism of genetic disease in outbred populations . Likewise, in recessive disorders, proband compound heterozygosity can be automatically distinguished from two variants in cis. However, these benefits are associated with increased sequencing costs. Moreover, both parents are not always available for sequencing or do not wish to have their genomes sequenced.
To understand how GEM performs in the absence of parental data, we reanalyzed the 63 trio and duo cases from the benchmark cohort as singleton proband cases. Surprisingly, we observed insignificant differences in the mean rank of the causal gene (Fig. 4A), GEM score of the causal gene (Fig. 4B), or number of candidates with BF ≥ 0.69 (Fig. 4C), using either manually or CNLP-extracted HPO terms. In contrast, this reanalysis was associated with a decline in the performance of Exomiser (Additional file 2: Figure S6). These analyses demonstrated that GEM was resilient to the absence of parental genotypes, a feature that could increase the cost effectiveness and adoption of WGS.
GEM scores optimize case review workflows
Conventional prioritization algorithms rank variants, enabling manual reviewers to start with the top ranked variants, and work their way down in the list until a convincing variant is identified for further curation, classification, and possible clinical reporting. This review process typically involves (a) assessing variant quality, deleteriousness, and prior clinical annotations; (b) evaluating whether there is a reasonable match between the phenotypes exhibited by the patient and those reported for condition(s) known to be associated with defects in the corresponding gene; and (c) considering the match in mode(s) of inheritance reported in the literature for the candidate disease and the patient’s diplotype.
GEM accelerates this process, because it intrinsically considers variant quality, deleteriousness, prior clinical annotations, and mode of inheritance. Furthermore, at manual review, GEM gene scores summarize the relative strength of evidence supporting the hypothesis that the gene is damaged and that this explains the proband’s phenotype.
GEM scores provide a logical framework for setting thresholds with regard to the optimal number of candidates that should be reviewed to achieve a desired diagnostic rate. This enables laboratory directors and clinicians to dynamically set optimal tradeoffs of interpretation time and diagnosis rate for specific patients, relative to their suspicion of a genetic etiology or results of other diagnostic tests.
We examined the effect of different BF thresholds on recall (true positive rate) and median number of gene candidates for review in the benchmark cohort (Fig. 5). In such analyses, it is germane to consider the concept of maximal true positive rate (or recall) to measure the theoretical proportion of true positive diagnoses recoverable by perfect interpretation when reviewing a set of N genes containing the true positive. For example, in the benchmark dataset, a GEM causal gene score threshold ≥ 0 would retain a median of ten candidates for review and provide a 99% maximal recall; whereas a threshold of ≥ 0.5 would retain a median of four candidates for review for a 97% maximal recall (Fig. 5).
These results illustrate how a tiered approach to case review using GEM gene scores could minimize the number of candidate genes to review, and, thereby manual interpretation effort. For example, a first pass review of candidates with a gene BF ≥ 0.69 provided an expected 95% diagnostic rate (and a corresponding median of 3 genes to be manually reviewed). If followed by a second pass using a threshold > 0, if no convincing candidates are found, an additional 4% possible diagnoses would be recovered, involving review of a median increment of seven genes. Application of this two-tiered approach to the benchmark dataset of 119 cases (Fig. 1), required manual final review of 395 candidate genes (3 genes in 115 cases and 10 genes in 5 cases), or an average of 3.3 candidate genes per case, for a maximal recall of 99%. Finally, review of candidates with BF < 0 recovered the last true positive in the benchmark cohort (COL4A4, ranked 40th in the GEM report with a BF = − 0.6. This case was a phenotypically and genotypically atypical autosomal dominant presentation of Alport syndrome 2 (MIM 203780).
Clinical decision support for diagnosis
Quantifying how well the observed phenotypes in a patient match the expected phenotypes of Mendelian conditions associated with a candidate gene is challenging for clinical reviewers and is a major interpretation bottleneck. In practice, clinicians look for patterns of phenotypes, biasing their observations. In addition, patient phenotypes evolve as their disease progresses. And there is considerable, disease-specific heterogeneity in the range of expected phenotypes. Simply comparing exact matches of the patient’s observed HPO terms with those expected for that disease is suboptimal, because the observed and expected HPO terms are often hierarchical neighbors, rather than exact matches. Missing terms, particularly those considered pathognomonic for a condition, and “contradictory” terms further complicate such comparisons by clinicians. Thus, generation of quantitative, standardized, unbiased models of disease similarity has proven elusive.
GEM can automate or provide clinical decision support for this process via a condition match (CM) score (“Methods”). The GEM CM score summarizes the match between observed and expected HPO phenotypes for genetic diseases and considers the known mode(s) of inheritance, associated gene(s), their genome location(s), proband sex, the pathogenicity of observed diplotypes, and ClinVar annotations. Importantly, CM scores reflect relationships between phenotype terms as expressed in the HPO ontology graph, enabling inclusion of imprecise matches in similarity comparisons. CM scores can be used in a wide variety of clinical settings to prioritize and quickly assess possible Mendelian conditions as candidate diagnoses, a process we term diagnostic nomination.
Specific, definitive, genetic disease diagnosis remains a significant challenge for clinical reviewers, even with the short, highly informative candidate gene lists provided by tools such as GEM. This is because many genes are associated with more than one Mendelian disease. For example, application of a GEM causal gene score threshold ≥ 0.69 to the 119 probands in the benchmark cohort results in a median of 3 gene candidates per proband (c.f. Fig. 5), associated with a maximal gene recall of 95%. However, because many genes are associated with more than one disease, clinical reviewers would actually need to consider around 12 candidate Mendelian conditions per proband (data not shown). This difficulty is exacerbated by the fact that most laboratory directors are not physicians and lack formal training in clinical diagnosis.
Determination of a specific, definitive genetic disease diagnosis among several candidates can be accomplished with a combination of GEM CM scores and causal gene scores (Fig. 6). Using the benchmark cohort’s true (reported) gene and disorder diagnoses as ground truth, we used a GEM gene score threshold ≥ 0.69 to recover gene candidates, and the associated CM scores to rank order the diseases associated with those gene candidates (Fig. 6A). Using CNLP-derived phenotypes, the true disease diagnosis was the top nomination by CM score in 75% of cases, within the top 5 in 91% of cases, and within the top 10 in 95% of cases. Performance was inferior with manually extracted phenotype terms. The area under the receiver-operator characteristic (ROC) curves (AUCs) were 0.90 and 0.88, for CNLP and manual terms, respectively (Fig. 6B). This implied that the larger number of CNLP-extracted terms conveyed greater information content, permitting better discrimination of the correct diagnostic condition, than sparse, manually extracted phenotypes .
In the benchmark cohort, 58 of the 100 candidate genes (excluding cases with causal, multigenic SVs) were associated with 2 or more disorders (median of 3 gene-disorder, maximum of 15; Additional file 2: Figure S7 shows the example of ERCC6). We measured how well the CM score distinguished between multiple, alternative disorders associated with the same gene (Fig. 6B). In these 58 cases, the AUC was less than that for CNLP with the entire set of candidate genes in the benchmark cohort (0.68 vs 0.9). This decrease can be at least partially explained by the high similarity (and in some cases identity) of the clinical features of different disorders associated with the same gene. Thus, a combination of GEM gene and CM scores can refine candidate disorders for clinical reporting, further reducing review times.
Reanalysis of previously unsolved cases
Recent reports show that reanalysis of older unsolved cases suspected of rare genetic disease can yield new diagnoses supported by incremental increases in knowledge of pathogenic variants, disease-gene discoveries, and reports of phenotype expansion for known disorders [93, 94]. While worthwhile, there are barriers to reanalysis, such as limited reimbursement and low incremental diagnostic yield, that limit use to physician requests. Ideally, all unsolved cases would be reanalyzed automatically periodically, and a subset with high likelihood of new findings would be prioritized for manual review. The strong correlation between true positive rates and GEM gene scores (Fig. 5) suggested a strategy for triaging reanalyzed cases for manual review: only cases for which the recalculated GEM score had increased sufficiently to suggest a high probability of a new diagnosis would pass the threshold for manual review. Likewise, GEM condition match scores could be used to search all prior cases to identify the subset of unsolved cases with support for particular Mendelian conditions, aiding cohort assembly for targeted reanalysis based upon particular proband phenotypes, or for review by particular medical specialists. Of note, an advantage of CNLP is that it is possible to automatically generate a new clinical feature list at time of reanalysis. This is particularly important in disorders whose clinical features evolve with time and were the observed features may be nondescript at presentation.
To test the utility of GEM for reanalysis, we selected 14 unsolved cases that had rWGS performed by RCIGM. For these reanalyzes, we used CNLP-derived HPO terms (Table 3) and a more stringent gene BF threshold ≥ 1.5 to restrict the search to very strongly supported candidates. Ten cases yielded no hits. Four cases returned a total of 7 candidate genes. Review of three cases did not return new diagnoses. In the remaining case, a new likely diagnosis was made of autosomal dominant Shwachman-Diamond Syndrome (MIM: 260400) or severe congenital neutropenia (MIM: 618752) [95, 96], both of which are associated with pathogenic variants in SRP54. The respective CM scores using 261 CNLP-derived terms were relatively high (0.893 and 0.672, respectively). The association of SRP54 and these disorders was first reported in November 2017  and entered in OMIM in January 2020 , which explained why it was not identified as the diagnosis originally in July 2017. The identified candidate p.Gly108Glu variant has been classified as “uncertain significance” by ACMG guidelines. However, if we were able to confirm de novo origin with paternal genotypes (which is currently lacking for this single proband case), the variant could be reclassified as “likely pathogenic” (meeting PM2, PM1, PP3, and PM6 of the ACMG guidelines). This was a singleton proband sequence and confirmation is being pursued. Thus, GEM reanalysis of 14 unsolved cases led to 7 gene-disorder reviews (an average of 0.5 per case), and yielded one likely new diagnosis, which was consistent with prior reanalysis yields [93, 94].
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.