Quantifying variant classifications with probabilistic reasoning

2025-05-21

I’m in Rome on holiday, and over coffee this morning, I started writing down some thoughts on the origins of “Quant”, how it began and where the first ideas came from.

In clinical genetics, the current paradigm focuses heavily on identifying and confirming true positives: known pathogenic variants observed in affected individuals. While this approach has served as the backbone of variant interpretation for decades, it underutilises the broader statistical landscape. Benign variants (true negatives), missed pathogenic variants (false negatives), and those of uncertain or moderate clinical significance all carry informative signals. These signals are rarely formalised or quantified, leaving major gaps in diagnostic interpretation and model calibration.

Over several years of considering this problem, it became clear that interpreting genetic data systematically requires more than a binary or quantitative classification of individual variants. It also demands a fundamentally Bayesian approach. I made a conscious decision to slow down on publishing and instead invest time in learning this side of statistics and simulation properly. Much of that time was spent working through textbooks. A major one was Bayesian Data Analysis by Gelman, Carlin, Stern, Dunson, Vehtari, and Rubin.

Variant outcomes represent a continuum of evidence, uncertainty, and causality. When I began simulating these probabilities more rigorously, I noticed a layered structure: single nucleotide variants follow relatively well-understood frequency distributions based on population data and inheritance models, but larger structural events or ambiguous annotations introduce cascading uncertainty. Unlike discrete events that map cleanly to allele frequency, these more complex variants often require probabilistic treatment across multiple steps of biological and interpretive inference.

This prompted a decision to proceed incrementally. Rather than attempting to solve all aspects of variant interpretation at once, we focus first on calculating prior probabilities for estimates of how likely a variant with a given classification (e.g. pathogenic, uncertain, benign) is to be observed in a given gene relative to phenotype (this causal condition is key), independent of the patient. These priors are based on known population allele frequencies, mode of inheritance, and curated variant classifications, forming a reference layer that can be precomputed and made widely accessible.

The resulting probabilistic model allows a shift from reactive confirmation toward expectation-driven inference. By quantifying what should be seen, even when it is not observed, this approach reframes variant interpretation in terms of calibrated genomic probabilities. It sets the foundation for integrating sequencing outcomes, variant uncertainty, and patient-specific data into a coherent Bayesian framework.

To formalise this, I define a framework for calculating prior probabilities of variant classifications across the genome. These priors are derived from whatever evidence sources are most reliable and at a minimum have to include population allele frequencies, classification (such as ClinVar), and the mode of inheritance (MOI). They are precomputed, gene- and classification-specific estimates of how likely a variant is to be observed in the population, regardless of any individual patient. These priors are then incorporated into a posterior model that updates based on observed sequencing evidence, enabling inference even in the absence of known pathogenic findings.

A quick side note: there are two simplifications I sometimes use when explaining this, though I don’t rely on them myself, are:

ClinVar-style labels like “benign” or “pathogenic” are just conveniences. Internally, I use the underlying evidence or a summarised statistic (e.g. quantified ACMG-style criteria).
“Pathogenic” as a label is only useful because it’s familiar to others. In my actual model, it carries no special status. Every classification contributes equally to inference. I use it in examples because most people struggle to picture the joint evidence distribution. If you can see that distribution instead, then great, just replace “pathogenic” with the full spectrum of evidence across variant-phenotype combinations.

At the variant level, we assign each variant $i$ a population allele frequency $p_i$, and define the total frequency across a gene as:

\[P_{\text{tot}} = \sum_i p_i\]

For dominant (autosomal or X-linked) conditions, where a single pathogenic allele is sufficient, the disease probability attributable to variant $i$ is:

\[p_{\text{disease},i} = p_i\]

For recessive conditions, in which two pathogenic alleles are required, the total probability of a disease-causing genotype is:

\[P_{\text{AR}} = P_{\text{tot}}^2 = \sum_i p_i^2 + 2\sum_{i < j} p_i p_j\]

In earlier models I made the error of counting twice, which I think is an easy mistake to make. To avoid double-counting compound heterozygotes, we partition this risk proportionally across variants:

\[p_{\text{disease},i} = p_i \cdot P_{\text{tot}}\]

If a variant is unobserved in the reference database ($p_i = 0$), I decided to assign a minimal occurrence probability. The idea is that if we’ve sequenced 100,000 individuals and never seen the variant, observing it in the next person is still possible. This minimum sets the lowest plausible frequency based on the available data, which is important when we want to avoid underestimating the chance of a rare but real variant.

\[p_i = \frac{1}{\max(\text{AN}) + 1}\]

Each variant is annotated with a classification score. In an early model I used ClinVar classifications, converted to a normally distributed score range of roughly benign = -5, uncertain = 0, pathogenic = +5. This is converted into a weight $W_i$ between 0 and 1. The weighted prior becomes:

\[p_i^\ast = W_i \cdot p_i\]

To account for uncertainty and inter-individual variability, we define a Beta prior distribution for the allele in a cohort of $N$ diploid individuals:

\[\pi_i \sim \text{Beta}(\alpha_i, \beta_i), \quad \alpha_i = \text{round}(2Np_i^\ast) + \tilde w_i,\; \beta_i = 2N - \text{round}(2Np_i^\ast) + 1\]

where $\tilde w_i = \max(1, S_i + 1)$ adds a pseudo-count based on classification strength.

Simulating $M$ draws from this prior, we compute normalised weights:

\[\tilde\pi_i^{(m)} = \frac{\pi_i^{(m)}}{\sum_j \pi_j^{(m)}}\]

The final step integrates observed sequencing status. Let $G_i$ indicate whether variant $i$ was detected, missed, or confirmed absent:

$G_i = 1$ if the variant is observed in the sample
$G_i = 0$ if the site was sequenced and no variant was found
$G_i = p_i$ if the site is unsequenced or failed QC

The gene-level posterior probability that a pathogenic variant is present is:

\[P^{(m)} = \sum_{i:\,S_i > 3} \tilde\pi_i^{(m)} \cdot G_i^{(m)}\]

where $S_i > 3$ filters for likely or fully pathogenic classifications based on my simple ClinVar scoring system in this model. Taking the median and quantiles of $P^{(m)}$ over 10,000 simulations yields a credible interval that quantifies the confidence in a complete genetic diagnosis.

By precomputing variant-level priors across all classifications, and combining these with sequencing evidence, the framework generates a calibrated, evidence-aware estimate of genetic causality. This approach supports real-time clinical interpretation, streamlines inconclusive cases, and enables robust downstream use in AI, statistics, and decision-making pipelines.

To demonstrate the power of this method, I simulated three real-world diagnostic scenarios that show how a Bayesian, simulation-based framework yields not just results but structured, evidence-aware conclusions. In the first, a single known pathogenic variant was observed with full sequencing coverage, resulting in a posterior probability of 1 for a correct diagnosis. In the second, which is more representative of typical cases, a good canidate variant was found but also a likely pathogenic variant was unsequenced. The model distributed posterior probability across both observed and missing variants (along with all other trivial variants), yielding a probability of 0.54 (95% CrI: 0.26 to 0.80) that the result is conclusively the best. If follow-up confirms that the missing variant is absent, the posterior rises to 1, making the interpretation definitive. In the third scenario, no variants were observed and all likely candidates were unsequenced, so the posterior correctly dropped to zero.

This framework does not return an opaque result. It provides a single summary of calibrated confidence, but this summary reflects a probability distribution with multiple layers of evidence, each contributing differently. The result makes it clear how and why a diagnostic conclusion was reached. It also shows exactly how much new evidence would improve certainty and when a result is already as complete as possible. This is the key shift: uncertainty becomes a measurable input into clinical and research decision-making.