Genetic Variants: Good or Bad?

Sarrah Rose
11 min readJul 3, 2023


This is my explanation of the paper: Large-scale clinical interpretation of genetic variants using evolutionary data and deep learning.

Photo by National Cancer Institute on Unsplash

Humans share an impressive 98% of their DNA with each other. Yet, the remaining 2% that varies can trigger significant differences in our health, leading to a spectrum of diseases, many of which remain mysteries to us. To put this into perspective, in the protein-coding regions of our genome alone, we’ve discovered 6.5 million missense variants — changes in one DNA base pair that result in the substitution of one amino acid for another in the protein made by a gene. Most of these variants lurk in the shadows, their impacts largely unknown and unexplored.

So, how do scientists currently tackle this vast, uncharted territory? They use high-throughput assays like Deep Mutational Scans (DMSs) and Multiplexed Assays of Variant Effects (MAVEs). which help to probe the effects of a multitude of genetic variants simultaneously.

1. Deep Mutational Scanning (DMS): DMS is a high-throughput technique designed to examine the effects of a multitude of genetic mutations in parallel. This method leverages advanced sequencing techniques and involves creating a large library of genetic variants, usually by mutating one or more specific genes in a population of cells or viruses. Each variant is then subjected to a selection pressure, such as a drug, to observe which mutations confer survival advantages or disadvantages under that condition.

The changes in the frequency of each variant in the population before and after the selection pressure are then measured using next-generation sequencing. By comparing these changes, researchers can infer the fitness effect of each mutation — this gives them a comprehensive map of how each possible mutation of the gene under study might impact the function or survival of the organism.

2. Multiplexed Assays of Variant Effect (MAVEs): MAVEs are another high-throughput technique used to evaluate the functional impact of large numbers of genetic variants concurrently. MAVEs typically involve synthesising a large library of DNA variants that each carry a different mutation in a specific gene. This library is then introduced into cells, and the effects of the variants are assessed using a functional assay.

For example, if the gene under study is critical for cell growth, researchers may measure the growth rate of cells carrying each variant. This data can then be used to infer the functional impact of each mutation. One key advantage of MAVEs is the ability to measure the effects of different mutations in the same experimental conditions, reducing the impact of experimental noise and allowing for more direct comparisons between variants.

Yet, these methods are often not without limitations.

  1. Their scalability to the entire human proteome is challenging due to the proteome’s immense complexity.
  2. Moreover, the efficacy of these technologies is tightly-bound to the relevance of the functional assays used. If the chosen assay doesn’t align with the biological role of the protein, the results may lack meaningful data about mutation impacts.
  3. This constraint is further complicated when proteins have multiple or poorly understood functions, or diseases stem from intricate multi-gene interactions.

The authors developed EVE, a two-step method, to determine the likelihood of human missense variants being pathogenic based on sequence variation across species. Firstly, they used a Bayesian Variational Autoencoder (VAE), a model adept at learning complex high-dimensional distributions, to analyse constraints from evolution in protein sequences from over 100,000 organisms. This process resulted in the “evolutionary index” for each single amino acid variant, indicating its relative likelihood compared to the wild type.

In the second step, they used unsupervised clustering methods.They applied a global-local mixture of Gaussian Mixture Models to the evolutionary indices, generating a continuous pathogenicity score, the EVE score, ranging from 0 (most benign) to 1 (most pathogenic), and assigned each variant to one of three categories: Benign, Uncertain, Pathogenic. This process allowed them to predict the propensity of a variant to be pathogenic without relying on previous clinical information.

Data Acquisition

The Bayesian VAE is trained on multiple sequence alignment (MSA) rather than just a single protein sequence to leverage the rich evolutionary information contained within MSA. In an MSA, the same positions across different sequences from various species, which are orthologous (equivalent) to the human protein, are aligned vertically. This alignment provides valuable insights into the conservation or variation of individual residues across evolutionary timescales.

Conserved positions likely indicate residues that are essential for the protein’s structure or function, and any mutation in these positions may lead to pathogenic consequences. Conversely, variable positions might tolerate more mutations without negative effects. By training the Bayesian VAE on this multi-species perspective, the model can better understand which mutations are likely to be tolerated (benign) and which are likely to be harmful (pathogenic).

In contrast, training the model on just a single protein sequence would provide only a one-dimensional view of the protein, without the rich context provided by evolution and comparative analysis across species. The model could lose crucial information about the relative importance of each residue and its role in determining protein functionality or pathogenicity.

The authors aimed to work with about 1,000 genes that are associated with disease. For each selected gene, they identified a single associated protein, choosing the canonical (standard) transcript according to the Uniprot/Swissprot database. They then built MSAs for each protein family. As explained above, this process identifies related sequences across many species, providing an evolutionary perspective on the proteins.

They then selected clinical labels from the ClinVar database, which contains reports of relationships between human genetic variations and phenotypes (observable characteristics), along with supporting evidence. These served as one set of ground-truth labels for the data, classifying the genetic variants into “Benign”, “Pathogenic” and “Uncertain”.

VAE & Model Architecture

A Variational Autoencoder (VAE) is an unsupervised learning model that learns to represent data in a lower-dimensional (latent) space, capturing the most salient aspects of this data. But crucially, VAEs are a type of generative model, meaning that they generate new data points by sampling points from this latent space and decoding them back into the original data space.

In this paper, they mention that data is generated from a latent variable z, referring to the idea that there is a lower-dimensional representation that captures the most important features of the data.

VAEs comprise two key components: the “encoder” and the “decoder”. The encoder takes in the data, in our context, amino acid sequences of a protein, and transforms it into a set of latent variables (z). The decoder then utilises these latent variables to reconstruct the original data.

The encoder network is captured by the conditional probability distribution, p(s|z, θ_p), where ‘s’ stands for a fixed-length amino acid sequence and ‘θ_p’ corresponds to the protein parameters belonging to that specific family. On the other hand, p(z|s, θ_p) represents the probability distribution of the decoder. It does so by learning a set of means (μ) and standard deviations (σ). Collectively, these parameters form a Gaussian distribution in the latent space.

The challenge in training a VAE arises when optimising the model’s parameters to enhance the likelihood of the synthetically generated data, given these parameters. However, this data is artificially created, meaning we can’t straightforwardly calculate these likelihood values. Hence, we train the VAE by maximising the Evidence Lower Bound (ELBO). This approach ensures that the model learns a distribution mirroring the real distribution of the data as closely as possible.

The ELBO is composed of two components: the reconstruction loss and the KL-divergence. The former represents the negative log-likelihood of the data given the latent variables, quantifying the decoder network’s ability to accurately reconstruct the original data from the latent variables. The latter is a measure of the disparity between two probability distributions. In the context of a VAE, it gauges the variance of the latent variables’ distribution from a specified prior distribution (usually a standard Gaussian). This term serves as a regularisation term, prompting the model to effectively utilise the latent space and prevent overfitting to the training data.

In this study, the researchers evolved the VAE model into a Bayesian VAE, where the model learns a distribution over the weights instead of a single set. In particular, they train a VAE to model the distribution of protein sequences. Each protein sequence is represented as a fixed-length string of amino acids, enabling the VAE to learn a distribution over these sequences. This approach facilitates the generation of new protein sequences that bear similarities to the training sequences, and aids in estimating the functional impact of various mutations on the protein.

Evolutionary Index

The evolutionary index of a protein variant is defined as a relative fitness compared with that of a wild-type sequence. The fitness of a sequence was measured using a crafty technique: comparing the difference in log-likelihood of the sequence with the wild type. They used the negative ELBO, a good approximation of the log marginal likelihood as their means of approximating this. For each variant of interest, they then took 200,000 samples from the latent space (a lower-dimensional space into which the high-dimensional input data is projected), and calculated the ELBO score for each protein variant by averaging it over the 200,000 samples.

Once the VAE has learned the parameters (means and standard deviations) of the approximate posterior distribution from the input data, it can generate new points in the latent space by sampling from this distribution. These new points are the “samples”, which are points in the latent space that individually correspond to a protein variant s.

Separate pathogenic & benign variants with GMM

The authors used a Gaussian Mixture Model (GMM), a probabilistic model that assumes data originates from a finite number of Gaussian distributions, to analyse the distribution of the evolutionary indices. By fitting a GMM directly to the distribution, each variant gets assigned a continuous pathogenicity score — the EVE score, and categorised as either Benign, Uncertain, or Pathogenic.

Suppose we have a data set, D = {x1, x2, …, xN}, with N data points. The objective of a GMM is to fit k Gaussian distributions to these data points. Each Gaussian distribution, denoted as ‘i’, is defined by a mean ‘μi’, standard deviation ‘σi’, and an associated weight ‘wi’, which indicates the distribution’s contribution to the overall model. The GMM parameters are thus expressed as {wi, μi, σi} for all i = 1, …, k.

In the GMM, the likelihood of a data point xj is computed as: p(xj) = Σi=1 to k [wi * N(xj | μi, σi)], where N(xj | μi, σi) is the likelihood of xj under a Gaussian distribution with mean ‘μi’ and standard deviation ‘σi’.

The aim of training a GMM is to discover the parameters {wi, μi, σi} that maximise the total likelihood of the data. Mathematically, we express this as: maximise Πj=1 to N [p(xj)]. Here, the symbol ‘Π’ represents the product operation, implying that we multiply together the likelihoods of the jth data point, p(xj), for every j from 1 to N.

To enhance numerical stability and avoid precision loss or errors from numerical representation limitations in computers, we often use log likelihood instead of likelihood in machine learning algorithms. Specifically, multiplying numerous probabilities can lead to a result very close to zero (underflow), which is beyond the precise representation capability of floating-point numbers. This underflow issue is mitigated by taking the sum of the log probabilities, since the log of a very small number yields a larger negative number, thus ensuring numerical stability. Therefore, we strive to maximise the sum of log likelihoods, expressed as: maximise Σj=1 to N [log(p(xj))].

Initiating the process, the researchers trained a two-component GMM on the complete distribution of evolutionary indices for all single amino acid variants across 1,081 proteins. Comprising two Gaussian components, this model aims to best fit these Gaussians to the data to describe the underlying distribution accurately, typically employing the Expectation-Maximization (EM) algorithm for parameter determination.

Once the primary GMM is trained, the learned parameters are used to train protein-specific GMMs — distinct models for each protein. These protein-centric GMMs, trained on the single amino acid variants unique to their respective proteins, organise the variants into two clusters via the two Gaussian components. The cluster bearing a higher mean evolutionary index is tagged as the Pathogenic cluster due to its lower sequence likelihood, with the alternate group labelled as the Benign cluster.

At this stage, we have two types of GMMs: an overarching model trained on the entirety of the data and individual models specific to each protein. Merging these two types of models, the researchers formulated a “global-local” mixture of GMMs, effectively creating an ensemble model that incorporates both universal and protein-specific predictions.

To achieve this, they computed a weighted average of the two probabilities, with the weight α designated as 0.3 for the protein-specific GMM, determined via a grid search. The resultant probability, expressed as p(Xs = 1|z Es) and known as the EVE score, denotes the pathogenic propensity of a specific variant.

The overarching GMM and the protein-specific GMM predictions for a particular protein ‘s’ with an evolutionary index ‘Es’ are unified using a weighted average. Given that p_o(Xs=1 | Es) and p_p(Xs=1 | Es) represent the pathogenic probabilities of ‘s’ according to the overarching and protein-specific GMMs respectively, the combined probability p(Xs=1 | Es) or the EVE score is calculated as follows:

p(Xs=1 | Es) = α * p_p(Xs=1 | Es, θp) + (1-α) * p_o(Xs=1 | Es, θo)

In this formula, ‘α’ represents the weight assigned to the protein-specific GMM (set to 0.3), while ‘θo’ and ‘θp’ refer to the parameters of the overarching and protein-specific GMMs, respectively.

Quantifying uncertainty in the cluster assignment

This classification method assigns each variant a probability of belonging to each group rather than making a hard decision, so the assignment has a degree of uncertainty. The researchers use a metric called predictive entropy (PE) to quantify this uncertainty for each variant.

Predictive entropy is a measure of uncertainty in a probabilistic model. In this case, a higher predictive entropy means that the model is more unsure about whether a particular protein variant is pathogenic or benign.

By excluding the variants with the highest predictive entropy (i.e., the variants about which the model is most uncertain), the researchers can increase the accuracy of their classification. This is because they are leaving out the “trickiest” cases, where the model is essentially flipping a coin to decide.

In this study, the researchers chose to exclude the 25% of variants that had the highest predictive entropy. By doing so, they are focusing on the 75% of variants about which the model is most confident. This approach led to an expected accuracy rate of over 90% for the classification of these variants, according to available clinical labels. In other words, they created a method that is very accurate, but only if you don’t mind that it says “I don’t know” for the hardest 25% of cases.


The researchers put the model to the test, comparing its predictions to already known clinical classifications of genetic variants. They gauged the model’s performance using two common yardsticks in machine learning: the Area Under the Receiver Operating Characteristic Curve (AUC) and accuracy. With an impressive overall AUC of 0.90 and an average AUC of 0.92 per protein, the EVE scores are highly predictive of clinical labels.

The study also found that EVE performs as well as, if not better than, MAVEs in predicting clinical significance of these genetic variants. For instance, both models achieved an AUC of 0.95 when evaluating the BRCA1 gene, and EVE outperformed the experimental data when predicting the clinical significance of the MSH2 and P53 genes.