My part of the DUS project involved statistical analyses of the frequencies of DNA uptake signalling sequences (DUS) in the coding regions of various bacterial genomes. The DUS contain strongly preserved words of lengths 9 and 10, and one part of the analyses involved finding the most frequent words of these lengths: or, more precisely, the most abundant words, where abundance is defined as the number of words found divided by the expected number.
Analysing abundance of specific words in a DNA sequence requires a model for random DNA to give the expected number of occurrences. At present, Markov chains are the most popular for this kind of analyses, though a problem with this approach has been pointed out: the parameters of the Markov chain are not known and must be estimated from the DNA sequence, which adds addition variance. An alternative formulation of the problem is that the Markov model does not produce random sequences with the same composition as the original DNA sequence, but with a composition that may vary.
A better approach is to use shuffled sequences: i.e. the sample space for random sequences consists of all sequences with the same composition as the original DNA sequence. The problem with this approach is that little is known about word frequencies in shuffled sequences.
The most commonly used model for random DNA or amino acid sequences, is the Markov chain. This is a random sequence X1X2X3... where the transition probabilities P[Xn+1=s|Xn=t]=f(s|t) are given. The order k case, where transition probabilities depend on the last k letters, can be expressed as a sequence (x1...xk)(x2...xk+1)... of k-words, so the above formulation is sufficiently general.
A good approximation for most word counts in a Markov chain is a Poisson distribution. This was also the distribution I initially used in my analyses. However, I quickly found that the actual word counts did not follow a Poisson distribution. In general, the deviation from the expected value was much larger than in the Poisson distribution. This may be a natural consequence of variations in DNA composition along the sequence, which invalidates the Markov assumption. My first solution, as used in the DUS project, was a simple and pragmatic one: I modelled the word count frequency as an over-dispersed Poisson distribution where the expected value was as given by the Markov chain model, but estimated the dispersion parameter from the complete set of word counts. The standard deviation was often around 50% larger than for the Poisson distribution, so the correction was vital in order to get reasonable results.
In a few special cases, however, I found under-dispersion rather than over-dispersion. DNA heterogeneity can cause over-dispersion, but not under-dispersion, so this clearly contradicts the above assumptions.
The Markov chain has another problem, aside from that coming sequence heterogeneity, which explains the occasional under-dispersion: it assumes that the transition probabilities are known. This is normally solved by estimating the transition probabilities using the transition frequencies in the sequence being analysed. However, when random sequences are generated using a Markov model with these transition probabilities, the composition and transition frequencies in the random sequences are allowed to vary. The Markov chain thus allows too much variation, which explains why real data will be under-dispersed relative to the Markov chain when there is little sequence heterogeneity.
This problem was first solved by Fisher for 2×2 tables by fixing the marginal distribution. For random sequences, this amounts to restricting random sequences to sequences with the exact same composition and transition frequency as the sequence being analysed. These are the shuffled sequences.
If we start with a sequence x=x1...xn, we can count the number Nx(s→t) of transitions s→t (i.e. the number of i with xi=s and xi+1=t). We may either let the sequence be cyclic (i.e. xn→x1), or include end→x1 and xn→end in the counts. A shuffling of x is then a sequence y so that Nx(s→t)=Ny(s→t) for all s and t. A random shuffling simply consists of picking shuffled sequences with uniform probability; this is the same as restricting a Markov chain to sequences with the exact same transition counts as the original sequence.
Sequence shuffling resolves the problem of unknown transition probabilitie because the model no longer depends on them. The main problem with sequence shuffling, however, was that little was known about the distribution of the number of occurences of longer words in shuffled sequences, whereas there is a rich theory available for Markov chains.
The project consisted of two major steps. First, I found a way to calculate the word count distribution in shuffled sequences. Though these results are exact, I find that there are approximations that are good enough for most practical purposes, and more convenient to work with. These results are presented in the article Exact distribution of word counts in shuffled sequences. The solution is a generalization of hypergeometric distributions.
Secondly, DNA heterogeneity may again be compensated for by allowing over-dispersion as I did for the Poisson model.
The number of transitions s→t→u in a randomly shuffled sequence is essentially the same as Fisher's 2×2 table: we know the number of t and the number of transitions s→t and t→u, and if these transitions are independent, the resulting number of s→t→u has a hypergeometric distribution Hyp[N(s→t),N(t→u);N(t)]. Since it is a restriction that transitions be combined to produce one sequence rather than a set of sequences, the transitions are not entirely independent, but the dependency can be calculated and is generally very small except for very short sequences.
When counting occurences of the word w=w0...wp, we can iterate this process. Let N_1=N(w0→w1), and N_i=Hyp[Ni-1,N(wi→wi+1);N(wi)]. Then, the iterated hypergeometric distribution Np-1 gives the number of occurences of w. A correction must be introduced to account for the restriction that transitions are not entirely independent, and if some of the letters wi are identical, this also changes the form of the expression slightly.
I have a manuscript in progress dealing with the practical use including the problem of DNA heterogeneity, although it has been put aside for a while awaiting publication of the theoretical results. Also, I am planing to make a computer program, probably in Java, and perhaps also as a web service, to do the computations.