BLAST, Basic Local Alignment Search Tool, finds local alignments between two sequences, A and B, by identifying all pairs of w-words (ie. words of length w) from A and B which scores above a given threshold, T. It then extends these w-words in both directions to give maximal sequence pairs (MSTs), ie. with maximal score.

One problem in doing sequence alignments, is that there are lots of possible alignments available; hence, even by chance alone, several local alignments with high scores must be expected. The question is if the best alignments found are significantly better than what would be expected by chance alone.

The question of statistical significance is as follows:

Could the discovered alignments just as well have been found in two independent, random sequences?

Basically, if it could well have appeared by chance alone, one cannot conclude that the alignment is a sign of homology (common ancestry).

In statistics, one usually referes to the p-value when describing statistical significance. Basically, if for two sequences, A and B, have found an MST with score S, the p-value may be defined as follows:

The p-value of an MST with score S is the likelihood that two random sequences (of the same lengths and letter compositions as A and B) will have an MST with score greater than or equal to S.

The p-value is not, however, the measurement primarily used.

The E-value is similar to the p-value; but whereas the p-value gives the likelihood of getting equally high scores by chance, the E-value gives the expected number of MSTs to be found by chance.

The E-value of an MST with score S is the expected number of MSTs to be found with score S or higher in two random sequences (of same lengths and letter compositions).

If one assumes, as is natural to do for random sequences, that different MSTs are independent, one would expect that the number of MSTs scoring above a given threshold, S, would be Poisson distributed:

If E(S) is the expected number of MSTs
scoring S or higher, the likelihood of getting exactly
n such MSTs is
E(S)^{n}*e^{-E(S)}/n!. In particular, the
likelihood of getting at least one is 1-e^{-E(S)}.

Hence, p-values and E-values are just different ways of expressing the same. It is, however, more common to refer the E-value.

Finding the E-value for a given score, basically amounts to calculating the expected number of MSTs with score S or higher for random sequences. We may then ask if what we actually find is consistent with what might have been expected from alignments of random sequences.

Please note that this section aims at giving a few basic ideas of the mathematics behind this, with emphasis on simplicity rather than completeness or formal correctness, and without going the mathematical details.

The basic assumption is that we pick random sequences
A and B from a given alphabet (eg. bases or
aminoacids) with each letter drawn independently and with a given
probability: eg. every letter x in the alphabet has a
given probability p_{x}.

We then make a second simplifying assumption. If we make a complete
table of A versus B, ie. the sequence
A horizontally and B vertically making an
m*n-table, the MSTs are sequences along the diagonals
providing high scores. We assume that we may analyse each diagonal
separately. This, rather than having an m*n-table, gives
us a list of the different alignments corresponding to the
diagonals. Hence, we may start off by analysing two sequences of the
same length, A=a_{1}a_{2}...a_{n}
and B=b_{1}b_{2}...b_{n}, where
a_{i} is matched against
b_{i}.

Rather than think about a sequence of random pairs of letters,
a and b, being drawn randomly with
likelihood p_{a}*p_{b}, this second
assumption allows us to think of this as random scores
s=score(a,b) being drawn with given
probabilities. So, each diagonal is a sequence of random scores.

Please note that, for any analyses of local alignments to be meaningfull, the expected random score must be negative. Otherwise, one would be expected to be able to increase the total score by just extending the sequence indefinitely.

Let's assume that we have a random sequence of scores:
s_{1}, s_{2}.... We may then find the
cumulative scores: S_{1}=s_{1},
S_{2}=s_{1}+s_{2}, ... and
ask the following question:

For a given threshold, T, what is the
likelihood that the cumulative score, S_{i}, will
exceed T for any i?

Now, let us assume that we have a function, Q(T) that gives this likelihood:

Let Q(T) be the likelihood that, for a random sequence of scores, the cumulative score exceeds the threshold T at some point.

Initially, the function Q(T) may be very complex and impractical to calculate. However, we can make rather strong statements about it just by adding a few assumptions: well, approximations actually. If we assume, as an approximation, that the scoring is a continuous process rather than a stepwise one - in particular, we assume that in order for the score to exceed T, it must at some point equal T - we can make the following argument:

The likelihood of reaching a score T,
is Q(T). Now, if
s_{1}+s_{2}+...+s_{k}=T, we may
use that same result on the sequence s_{k+1},
s_{k+2}, ... to argue that the likelihood of
gaining a further score of U from there, is
Q(U). Hence, the likelihood, Q(T+U), of
reaching a score of T+U is given by
Q(T)*Q(U). The only function which satisfies this is
Q(T)=exp(-l*T) for some constant l.

Of course, all values here move in steps, and are not continuous. However, the continuity assumption serves well as an approximation.

To find the constant, l, we can again look at the
start of the sequence. If the first score,
s_{1}=s is known, the likelihood of reaching
T is Q(T-s). If possible scores are
s_{i} with likelihoods
p_{i}, we can sum over all these
possibilities:

Q(S) = p_{1}Q(S-s_{1})+
p_{1}Q(S-s_{2})+... writes out to give
exp(-l*S) = p_{1}exp[l*(S-s_{1})]+...
=exp(l*S)*[p_{1}exp(l*s_{1})+...]. We simplify
this to give the equation
p_{1}exp(l*s_{1})+...=1 which we may
solve to give l.

So, now we know that Q(T)=exp(-l*T) where we know the value of l.

Let's assume that we first find all subsequences which exceed a
threshold T_{0}, and that there are
N_{0} of these. As before, we can argue that if
we have a seqeuence whose score has reached
T_{0}, the likelihood that it will gain a further
score of U from where T_{0} was
reached, is Q(U). Hence, if we let
T=T_{0}+U, we find that the expected number MSTs
exceeding a score T is
N_{0}*exp[-l*(T-T_{0})]. By setting
K=N_{0}*exp(l*T_{0}), and arguing that the
expected number should be proportional to the length of the sequence,
we get the following result:

The expected number of MSTs scoring at least T, the E-value, is given by E(T)=K*exp(-l*T) where K=k*n when searching a single alignment of length n or K=k*m*n when making all alignments of two sequences of lengths m and n and k is a constant depending only on the scores and their likelihoods (ie. the scoring matrix and the letter frequencies).

The calculation of k is somewhat more complicated, and will not be dealt with here. For Blosum62, k=0.318.

We now have expressions for the expected number of MSTs with a given score for sequences of given lengths. We may then find what MST lengths should be expected by rewriting the expression for E(T) slightly.

E(T) = k*m*n*exp(-l*T) = exp[-l*(T-t)] where t=ln(k*m*n)/l is roughly how long an MST we should expect to find.

The above did not look into alignments with gaps: insertions and deletions. Many of the arguments and approximations would, however, also be applicable to alignments with gaps. The constants l and k, however, must be modified to account for gap penalties; they will also depend on the gap penalty and gap extention penalty (given a linear gap penalty function).

Another reason for assuming that the above results also apply to the alignments with gaps, only with different constants, comes from extreme value theory: the statistics of extreme values (eg. the top alignment scores).

Last modified
June 21, 2007.