Welcome to the Gene-wise Prediction Explorer!
ClinGen SVI Calibration Genome-wide Plot
Gene-wise Plot
Data Table
Welcome to the Gene-wise Prediction Explorer!
About the Data
ClinGen SVI Calibration Dataset:
The training (ClinVar 2019) and test (ClinVar 2020) data sets used by the ClinGen SVI for predictor calibrations were downloaded from their supplement. The datasets were combined to create a comprehensive dataset of 20,948 variants across 2,711 genes.
ClinVar 2023 Dataset:
To include the latest ClinVar variants in our analyses, we generated the ClinVar 2023 Dataset by downloading a GRCh37 VCF file containing all variants present in ClinVar as of August 23rd, 2023. The VCF file was then annotated with REVEL and BayesDel (version 1) predictor scores and gnomAD (version 2.1.1) allele frequencies using the filter-based annotation feature (hg19 assembly, dbsnfp42c for predictor scores and genome_211 and exome_211 for gnomAD allele frequencies) in Annovar (version 2020_06_07). Like the ClinGen SVI Calibration Dataset, the ClinVar 2023 Dataset was filtered to retain 1+ star, non-VUS missense variants. Genes without any pathogenic variants were excluded, and variants with allele frequencies exceeding 0.01 were also excluded. Allele frequencies were derived from gnomAD exome data unless the variant was not found in exome data, in which case allele frequencies from whole genomes were used. We then mapped Entrez gene IDs to HGNC gene names for use in subsequent analyses. Next, we filtered our dataset on genes classified as having a definitive, strong, or moderate association with disease from the GenCC database (last accessed March 12th, 2024) as performed in Stenton et al. The final filtered ClinVar 2023 Dataset consisted of 89,947 variants across 3,668 genes.
ClinVar 2023 Dataset without training variants:
To analyze REVEL and BayesDel performance while excluding training variants, we cross-referenced the ClinGen SVI Calibration Dataset with a ClinVar VCF file from December 2020. We considered any variants absent from the ClinGen SVI Calibration Dataset and present in the ClinVar December 2020 download to be training variants because the ClinGen SVI Calibration Dataset was filtered to exclude REVEL and BayesDel training variants. The training variants identified by this analysis were removed from the Clinvar 2023 Dataset creating the Clinvar 2023 Dataset without training variants, which consisted of 71,791 variants across 3,623 genes.