DNA sequence classification via an EM algorithm
Implementation of the SeqEM algorithm
Introduction
In the article , the authors describe a new approach to determine for an individual whether or not a genetic variation is found at a specific position in a genome. The authors propose a new method, based on the EM algorithm, called SeqEM, which is supposed to be more precise and accurate than the methods that have existed until now. The goal is to implement the algorithm by hand in R.
Here, we propose a novel genotype-calling algorithm that, in contrast to the other methods, estimates parameters underlying the posterior probabilities in an adaptive way rather than arbitrarily specifying them a priori. The algorithm, which we call SeqEM, applies the well-known Expectation-Maximization algorithm to an appropriate likelihood for a sample of unrelated individuals with next-generation sequence data, leveraging information from the sample to estimate genotype probabilities and the nucleotide-read error rate.
Context
The element that represents a position in the DNA is called a nucleotide. So to identify the genotype of an individual, the genome of the individual is compared to a reference genome. This is equivalent to comparing the nucleotides of the individual’s genome to the nucleotides of the reference genome. If the nucleotide is identical to the reference nucleotide it is said to be a reference nucleotide (R), otherwise it is said to be a variant nucleotide (V). However, it is not possible to compare the entire DNA to the reference DNA. We therefore choose a fragment of the DNA that interests us (genetic region of interest) and we “copy” (by amplification) this fragment several times. Afterwards, we can align all these fragments with the genetic region of interest of the reference genome. The number of overlaps of these DNA fragments is called reading depth, which we will note N. One is then able to count, line by line, for each nucleotide in this “overlap area” the number of reference and variant nucleotides. The number of variant nucleotides X, which can vary between 0 and N, is noted at a precise position. For a given position in the DNA, i.e. a certain nucleotide, 3 types (genotypes) of individuals can be distinguished:
- VV homozygotes
variant nucleotides - RR homozygotes
reference nucleotides - (RV) heterozygotes
mixture between variant and reference nucleotides
In a perfect world one would therefore, theoretically, be able to identify with certainty the genotype of an individual. However, we are faced with a certain “nucleotide reading error”, which may, for example, be due to errors in the coding of DNA fragments. This error then allows certain nucleotides to be incorrectly encoded and, for example, an RR homozygote presents a nucleotide varying at the position of interest in one of the aligned DNA fragments. Moreover, the number of observed variant nucleotides X does not only depend on the risk of this error, which we will note
We deduce that, for example if the individual is a RR homozygous, the probability of observing
The model
Let
We designate by
the number of observed variant nucletoides the true genotype of the individual
We note
is the reading error, i.e. the probability that a nucleotide is wrongly identified as and vice versa. and are the proportions of and genotypes in the population is the proportion of genotype in the population
So, we find that
Knowing that the pairs
Simulations
In order to better understand the problem and to recognize why an algorithm such as SeqEm is needed to determine the genotype, some simulations will be carried out to understand the motivations behind the development of such an algorithm.
To carry out this simulation we distinguish 4 steps:
- A number of individuals
is chosen, as well as the proportions of VV homogyzotes and heterogyzotes in the population ( and ). - A sample of size
of the genotypes (variable ) of the individuals is generated according to a multinomial distribution defined by the parameters and . - For the reading depth
and the reading error , we simulate for each individual according to the law of the number of observed variant nucleotides. - Steps 1 - 3 are repeated for different values of
and .
To simplify this procedure, steps 1-3 are integrated into a function called data_generation which, depending on the user’s choice, generates a graphical representation of the
## Step 1 :
S = 500 # number of individuals
pVV = 0.4 # proportion of VV homozygotes in the population
pRV = 0.3 # proportion of heterozygotes in the population
data_generation = function(S, pVV, pRV, N, alpha, getData=FALSE, plot=FALSE){
## Step 2 : Simulation of a sample of genotypes of all S individuals
Z = sample(x = c("VV", "RV", "RR"),
size = S,
replace=TRUE,
prob = c(pVV, pRV, 1-pVV-pRV))
N = rep(N,S) # vector of reading depths for all individuals
X = rep(NA,S) # vector of the number of variant nucleotides variants for all individuals
## Step 3 : Simulation of the number of variant nucleotides for each individual
# (according to his genotype)
for (i in 1:S){
if (Z[i]=="VV"){X[i]=rbinom(1,N[i],1-alpha)}
else if (Z[i]=="RV"){X[i]=rbinom(1,N[i],0.5)}
else {X[i]=rbinom(1,N[i],alpha)}
}
if (plot==TRUE){
# One-dimensional scatter plot and histogram
par(mfrow=c(1,2), mar=c(13,5,5,2), xpd=TRUE)
plot(1:S, X,
col=ifelse(Z=="RV","green",ifelse(Z=="VV","blue","red")), xlab="Individuals")
legend("topleft",
inset=c(0,-0.45),
legend=c("VV","RV","RR"),
col=c("blue","green","red"),
title="Genotypes",
pch=1, cex=0.35)
par(mar=c(13,2,5,2))
hist(X, breaks=30, main="")
}
if(getData==TRUE){return(list(X=X, Z=Z))}
}
Since the difficulty of determining an individual’s genotype depends neither on the number of individuals nor on the proportions of different genotypes in the population, our understanding of the problem would not be broadened by varying these parameters.
The reading error
set.seed(1)
## Step 4
data_generation(S, pVV, pRV, N=50, alpha=0.001, plot=TRUE)
data_generation(S, pVV, pRV, N=10, alpha=0.01, plot=TRUE)
data_generation(S, pVV, pRV, N=50, alpha=0.2, plot=TRUE)
ata_generation(S, pVV, pRV, N=10, alpha=0.2, plot=TRUE)
It is immediately noticeable that for
A posteriori distribution:
and its derivatives and formulas for the E/M steps
We know that if the pairs In order to be able to maximize
The partial derivatives are given by
We are therefore obliged to solve the following system of equations:
According to the calculations in the appendix we then find the expressions for updating the parameters :
Implementation of the EM algorithm
In order to simplify the implementation of the EM algorithm (or rather SeqEM) we define a function named SeqEM which takes as argument the vector
This function calculates
SeqEM = function(X, alpha_start, pVV_start, pRV_start, N, niter=500){
S=length(X)
N=rep(N,S)
alpha_new <- alpha_start
pVV_new <- pVV_start
pRV_new <- pRV_start
for(iter in niter){
alpha_old <- alpha_new
pVV_old <- pVV_new
pRV_old <- pRV_new
# Computation of the etas
etaVV = sapply(1:S, function(i){(dbinom(X[i],N[i],1-alpha_old)*pVV_old) /
(dbinom(X[i],N[i],1-alpha_old)*pVV_old +
dbinom(X[i],N[i],0.5)*pRV_old +
dbinom(X[i],N[i],alpha_old)*(1-pVV_old-pRV_old))})
etaRV = sapply(1:S, function(i){(dbinom(X[i],N[i],0.5)*pRV_old) /
(dbinom(X[i],N[i],1-alpha_old)*pVV_old +
dbinom(X[i],N[i],0.5)*pRV_old +
dbinom(X[i],N[i],alpha_old)*(1-pVV_old-pRV_old))})
etaRR = sapply(1:S, function(i){(dbinom(X[i],N[i],alpha_old)*(1-pVV_old-pRV_old)) /
(dbinom(X[i],N[i],1-alpha_old)*pVV_old +
dbinom(X[i],N[i],0.5)*pRV_old +
dbinom(X[i],N[i],alpha_old)*(1-pVV_old-pRV_old))})
# Updating the parameters
alpha_new <- sum(etaRR*X + etaVV*(N-X)) / sum(N*(etaRR+etaVV))
pVV_new <- mean(etaVV)
pRV_new <- mean(etaRV)
}
# Calling of genotypes according to estimated parameters
Z_EM = rep(NA,S) # vector of called genotypes
for (i in 1:S){
# Calculation of the likelihoods of the different genotypes for individual i
probs = c(dbinom(X[i],N[i],1-alpha_new)*pVV_new,
dbinom(X[i],N[i],0.5)*pRV_new,
dbinom(X[i],N[i],alpha_new)*(1-pVV-pRV_new))
# For individual i, the genotype with the highest likelihood of occurrence is chosen.
Z_EM[i] = ifelse(which.max(probs)==1,
"VV",
ifelse(which.max(probs)==2,
"RV",
"RR"))
}
return(list(alpha=alpha_new, pVV=pVV_new, pRV=pRV_new, Z=Z_EM))
}
Experimentations
In order to simplify the implementation of the experiments, we create a function named EM_accuracy which allows us to measure the proportion of genotypes correctly determined by the SeqEM algorithm, i.e. the classification accuracy. This function compares the vector
# Measuring the accuracy of the algorithm
EM_accuracy = function(Z, Z_EM){
return(mean(Z == Z_EM)) # Proportion of genotypes correctly called
}
Since we know that the difficulty of genotype calling depends strongly on the read error
S = 1000
pVV = 0.3
pRV = 0.2
alpha_values = c(0.4, 0.3, 0.25, 0.2, 0.15, 0.1, 0.05, 0.02, 0.015, 0.01, 0.005, 0.002, 0.001)
N_values = c(3, 5, 10, 15, 20, 50, 100)
accuracies = c()
set.seed(1)
for (alpha in alpha_values){
for (N in N_values){
# Data simulation
data = data_generation(S, pVV, pRV, N, alpha, getData=TRUE)
# Implementation of the SeqEM algorithm
SeqEM_result = SeqEM(data$X, alpha_start=0.1, pVV_start=0.1, pRV_start=0.1, N)
accuracies = c(accuracies,EM_accuracy(data$Z, SeqEM_result$Z))
}
}
accuracy_matrix = matrix(accuracies,
nrow=length(alpha_values),
ncol=length(N_values),
byrow=TRUE,
dimnames = list(alpha_values, N_values))
The proportions of genotypes in the population were arbitrarily chosen:
# Display of the table containing the results
kbl(accuracy_matrix, format="latex", booktabs = TRUE, escape=FALSE) %>%
kable_styling() %>%
add_header_above(c(" ", "N" = length(N_values)), bold=TRUE) %>%
group_rows("$\\alpha$", 1, length(alpha_values)) %>%
row_spec(0, bold=TRUE) %>%
column_spec(1, bold=TRUE)
As expected, it is clear that the accuracy of the SeqEM algorithm depends on the reading depth
This can also be illustrated graphically.
plot(-1,-1,
xlim = c(0,max(alpha_values)), ylim = c(min(accuracy_matrix),max(accuracy_matrix)),
xlab = TeX('$\\alpha$'), ylab = "Accuracy")
colors = palette.colors(length(N_values))
for (i in 1:length(N_values)){
lines(rev(alpha_values), rev(accuracy_matrix[,i]), col=colors[i])
}
legend("bottomleft", legend=N_values, col=colors, title="N", lty=1, cex=0.6)
Graphically we can, obviously, see that the algorithm is the most accurate for large values of
In a second experiment we want to try to conclude on the influence of the choice of the starting values of the parameters in the SeqEM algorithm. For this reason we choose to vary the starting values of 2 of the 3 parameters at the same time (thus 3 cases in total) and to observe the variation in terms of the performance of the algorithm.
S = 1000
pVV = 0.3
pRV = 0.2
alpha = 0.2
N = 15
alpha_start_values = seq(0.05,0.4,length.out=10)
pVV_start_values = seq(0.05,0.4,length.out=10)
pRV_start_values = seq(0.05,0.4,length.out=10)
# Variation of alpha_start and pVV_start
accuracies = c()
set.seed(1)
for (alpha_start in alpha_start_values){
for (pVV_start in pVV_start_values){
# Data simulation
data = data_generation(S, pVV, pRV, N, alpha, getData=TRUE)
# Implementation of the SeqEM algorithm
SeqEM_result = SeqEM(data$X, alpha_start, pVV_start, pRV_start=0.1, N)
accuracies = c(accuracies,EM_accuracy(data$Z, SeqEM_result$Z))
}
}
accuracy_matrix_alpha_pVV = matrix(accuracies,
nrow=length(alpha_start_values),
ncol=length(pVV_start_values),
byrow=TRUE,
dimnames = list(alpha_start_values, pVV_start_values))
persp3D(alpha_start_values,pVV_start_values,accuracy_matrix_alpha_pVV,
theta=20, phi=20, expand=0.5,
xlab="alpha_start", ylab="pVV_start", zlab="Accuracy")
# Variation of alpha_start and pRV_start
accuracies = c()
set.seed(1)
for (alpha_start in alpha_start_values){
for (pRV_start in pRV_start_values){
# Simulation de données
data = data_generation(S, pVV, pRV, N, alpha, getData=TRUE)
# Implementation of the SeqEM algorithm
SeqEM_result = SeqEM(data$X, alpha_start, pRV_start=0.1, pRV_start, N)
accuracies = c(accuracies,EM_accuracy(data$Z, SeqEM_result$Z))
}
}
accuracy_matrix_alpha_pRV = matrix(accuracies,
nrow=length(alpha_start_values),
ncol=length(pRV_start_values),
byrow=TRUE,
dimnames = list(alpha_start_values, pRV_start_values))
persp3D(alpha_start_values,pRV_start_values,accuracy_matrix_alpha_pRV,
theta=20, phi=20, expand=0.5,
xlab="alpha_start", ylab="pRV_start", zlab="Accuracy")
# Variation of pVV_start and pRV_start
accuracies = c()
set.seed(1)
for (pVV_start in pVV_start_values){
for (pRV_start in pRV_start_values){
# Simulation de données
data = data_generation(S, pVV, pRV, N, alpha, getData=TRUE)
# Implementation dof the SeqEM algorithm
SeqEM_result = SeqEM(data$X, alpha_start=0.1, pVV_start, pRV_start, N)
accuracies = c(accuracies,EM_accuracy(data$Z, SeqEM_result$Z))
}
}
accuracy_matrix_pVV_pRV = matrix(accuracies,
nrow=length(pVV_start_values),
ncol=length(pRV_start_values),
byrow=TRUE,
dimnames = list(pVV_start_values, pRV_start_values))
persp3D(pVV_start_values,pRV_start_values,accuracy_matrix_pVV_pRV,
theta=20, phi=20, expand=0.5,
xlab="pVV_start", ylab="pRV_start", zlab="Accuracy")
Graphically, we can clearly see that the choice of the starting value of
Conclusions (review + perspectives)
The idea behind the SeqEM algorithm may indeed be an interesting basis for the development of new genotyping methods or technologies. Its ease of implementation and good comprehensibility are, without doubt, desirable characteristics. On the other hand, it is rather only an introduction to these new methods, because the SeqEM algorithm has some weaknesses that must necessarily be corrected and because the assumptions for the implementation of the algorithm have been extremely simplified in the article.
For example, the hypothesis of independence between individuals, which was very useful in formulating
Annexe
Since the first equation depends only on
We continue by solving the system of equations composed of the two other equations of the initial system.
Using this equality in the first system of equations (1), we find thatUsing this result in the second system of equations (2), we find that