BLAST, Basic Local Alignment Search Tool, finds local alignments between two sequences, A and B, by identifying all pairs of w-words (ie. words of length w) from A and B which scores above a given threshold, T. It then extends these w-words in both directions to give maximal sequence pairs (MSTs), ie. with maximal score.
One problem in doing sequence alignments, is that there are lots of possible alignments available; hence, even by chance alone, several local alignments with high scores must be expected. The question is if the best alignments found are significantly better than what would be expected by chance alone.
The question of statistical significance is as follows:
Could the discovered alignments just as well have been found in two independent, random sequences?
Basically, if it could well have appeared by chance alone, one cannot conclude that the alignment is a sign of homology (common ancestry).
In statistics, one usually referes to the p-value when describing statistical significance. Basically, if for two sequences, A and B, have found an MST with score S, the p-value may be defined as follows:
The p-value of an MST with score S is the likelihood that two random sequences (of the same lengths and letter compositions as A and B) will have an MST with score greater than or equal to S.
The p-value is not, however, the measurement primarily used.
The E-value is similar to the p-value; but whereas the p-value gives the likelihood of getting equally high scores by chance, the E-value gives the expected number of MSTs to be found by chance.
The E-value of an MST with score S is the expected number of MSTs to be found with score S or higher in two random sequences (of same lengths and letter compositions).
If one assumes, as is natural to do for random sequences, that different MSTs are independent, one would expect that the number of MSTs scoring above a given threshold, S, would be Poisson distributed:
If E(S) is the expected number of MSTs scoring S or higher, the likelihood of getting exactly n such MSTs is E(S)n*e-E(S)/n!. In particular, the likelihood of getting at least one is 1-e-E(S).
Hence, p-values and E-values are just different ways of expressing the same. It is, however, more common to refer the E-value.
Finding the E-value for a given score, basically amounts to calculating the expected number of MSTs with score S or higher for random sequences. We may then ask if what we actually find is consistent with what might have been expected from alignments of random sequences.
Please note that this section aims at giving a few basic ideas of the mathematics behind this, with emphasis on simplicity rather than completeness or formal correctness, and without going the mathematical details.
The basic assumption is that we pick random sequences A and B from a given alphabet (eg. bases or aminoacids) with each letter drawn independently and with a given probability: eg. every letter x in the alphabet has a given probability px.
We then make a second simplifying assumption. If we make a complete table of A versus B, ie. the sequence A horizontally and B vertically making an m*n-table, the MSTs are sequences along the diagonals providing high scores. We assume that we may analyse each diagonal separately. This, rather than having an m*n-table, gives us a list of the different alignments corresponding to the diagonals. Hence, we may start off by analysing two sequences of the same length, A=a1a2...an and B=b1b2...bn, where ai is matched against bi.
Rather than think about a sequence of random pairs of letters, a and b, being drawn randomly with likelihood pa*pb, this second assumption allows us to think of this as random scores s=score(a,b) being drawn with given probabilities. So, each diagonal is a sequence of random scores.
Please note that, for any analyses of local alignments to be meaningfull, the expected random score must be negative. Otherwise, one would be expected to be able to increase the total score by just extending the sequence indefinitely.
Let's assume that we have a random sequence of scores: s1, s2.... We may then find the cumulative scores: S1=s1, S2=s1+s2, ... and ask the following question:
For a given threshold, T, what is the likelihood that the cumulative score, Si, will exceed T for any i?
Now, let us assume that we have a function, Q(T) that gives this likelihood:
Let Q(T) be the likelihood that, for a random sequence of scores, the cumulative score exceeds the threshold T at some point.
Initially, the function Q(T) may be very complex and impractical to calculate. However, we can make rather strong statements about it just by adding a few assumptions: well, approximations actually. If we assume, as an approximation, that the scoring is a continuous process rather than a stepwise one - in particular, we assume that in order for the score to exceed T, it must at some point equal T - we can make the following argument:
The likelihood of reaching a score T, is Q(T). Now, if s1+s2+...+sk=T, we may use that same result on the sequence sk+1, sk+2, ... to argue that the likelihood of gaining a further score of U from there, is Q(U). Hence, the likelihood, Q(T+U), of reaching a score of T+U is given by Q(T)*Q(U). The only function which satisfies this is Q(T)=exp(-l*T) for some constant l.
Of course, all values here move in steps, and are not continuous. However, the continuity assumption serves well as an approximation.
To find the constant, l, we can again look at the start of the sequence. If the first score, s1=s is known, the likelihood of reaching T is Q(T-s). If possible scores are si with likelihoods pi, we can sum over all these possibilities:
Q(S) = p1Q(S-s1)+ p1Q(S-s2)+... writes out to give exp(-l*S) = p1exp[l*(S-s1)]+... =exp(l*S)*[p1exp(l*s1)+...]. We simplify this to give the equation p1exp(l*s1)+...=1 which we may solve to give l.
So, now we know that Q(T)=exp(-l*T) where we know the value of l.
Let's assume that we first find all subsequences which exceed a threshold T0, and that there are N0 of these. As before, we can argue that if we have a seqeuence whose score has reached T0, the likelihood that it will gain a further score of U from where T0 was reached, is Q(U). Hence, if we let T=T0+U, we find that the expected number MSTs exceeding a score T is N0*exp[-l*(T-T0)]. By setting K=N0*exp(l*T0), and arguing that the expected number should be proportional to the length of the sequence, we get the following result:
The expected number of MSTs scoring at least T, the E-value, is given by E(T)=K*exp(-l*T) where K=k*n when searching a single alignment of length n or K=k*m*n when making all alignments of two sequences of lengths m and n and k is a constant depending only on the scores and their likelihoods (ie. the scoring matrix and the letter frequencies).
The calculation of k is somewhat more complicated, and will not be dealt with here. For Blosum62, k=0.318.
We now have expressions for the expected number of MSTs with a given score for sequences of given lengths. We may then find what MST lengths should be expected by rewriting the expression for E(T) slightly.
E(T) = k*m*n*exp(-l*T) = exp[-l*(T-t)] where t=ln(k*m*n)/l is roughly how long an MST we should expect to find.
The above did not look into alignments with gaps: insertions and deletions. Many of the arguments and approximations would, however, also be applicable to alignments with gaps. The constants l and k, however, must be modified to account for gap penalties; they will also depend on the gap penalty and gap extention penalty (given a linear gap penalty function).
Another reason for assuming that the above results also apply to the alignments with gaps, only with different constants, comes from extreme value theory: the statistics of extreme values (eg. the top alignment scores).