My part of the DUS project involved statistical analyses of the frequencies of DNA uptake signalling sequences (DUS) in the coding regions of various bacterial genomes. The DUS contain strongly preserved words of lengths 9 and 10, and one part of the analyses involved finding the most frequent words of these lengths: or, more precisely, the most abundant words, where abundance is defined as the number of words found divided by the expected number.

Analysing abundance of specific words in a DNA sequence requires a model for random DNA to give the expected number of occurrences. At present, Markov chains are the most popular for this kind of analyses, though a problem with this approach has been pointed out: the parameters of the Markov chain are not known and must be estimated from the DNA sequence, which adds addition variance. An alternative formulation of the problem is that the Markov model does not produce random sequences with the same composition as the original DNA sequence, but with a composition that may vary.

A better approach is to use shuffled sequences: i.e. the sample space for random sequences consists of all sequences with the same composition as the original DNA sequence. The problem with this approach is that little is known about word frequencies in shuffled sequences.

The most commonly used model for random DNA or amino acid
sequences, is the *Markov chain*. This is a random
sequence `X _{1}X_{2}X_{3}...` where the

A good approximation for most word counts in a Markov chain is a Poisson distribution. This was also the distribution I initially used in my analyses. However, I quickly found that the actual word counts did not follow a Poisson distribution. In general, the deviation from the expected value was much larger than in the Poisson distribution. This may be a natural consequence of variations in DNA composition along the sequence, which invalidates the Markov assumption. My first solution, as used in the DUS project, was a simple and pragmatic one: I modelled the word count frequency as an over-dispersed Poisson distribution where the expected value was as given by the Markov chain model, but estimated the dispersion parameter from the complete set of word counts. The standard deviation was often around 50% larger than for the Poisson distribution, so the correction was vital in order to get reasonable results.

In a few special cases, however, I found under-dispersion rather than over-dispersion. DNA heterogeneity can cause over-dispersion, but not under-dispersion, so this clearly contradicts the above assumptions.

The Markov chain has another problem, aside from that coming sequence heterogeneity, which explains the occasional under-dispersion: it assumes that the transition probabilities are known. This is normally solved by estimating the transition probabilities using the transition frequencies in the sequence being analysed. However, when random sequences are generated using a Markov model with these transition probabilities, the composition and transition frequencies in the random sequences are allowed to vary. The Markov chain thus allows too much variation, which explains why real data will be under-dispersed relative to the Markov chain when there is little sequence heterogeneity.

This problem was first solved by Fisher for 2×2 tables by
fixing the marginal distribution. For random sequences, this amounts
to restricting random sequences to sequences with the exact same
composition and transition frequency as the sequence being
analysed. These are the *shuffled sequences*.

If we start with a sequence `x=x _{1}...x_{n}`, we can count the
number

Sequence shuffling resolves the problem of unknown transition probabilitie because the model no longer depends on them. The main problem with sequence shuffling, however, was that little was known about the distribution of the number of occurences of longer words in shuffled sequences, whereas there is a rich theory available for Markov chains.

The project consisted of two major steps. First, I found a way to calculate the word count distribution in shuffled sequences. Though these results are exact, I find that there are approximations that are good enough for most practical purposes, and more convenient to work with. These results are presented in the article Exact distribution of word counts in shuffled sequences. The solution is a generalization of hypergeometric distributions.

Secondly, DNA heterogeneity may again be compensated for by allowing over-dispersion as I did for the Poisson model.

The number of transitions `s→t→u` in
a randomly shuffled sequence is essentially the same as Fisher's
2×2 table: we know the number of `t` and the
number of transitions `s→t` and `t→u`, and if these transitions are independent,
the resulting number of `s→t→u` has a
hypergeometric distribution `Hyp[N(s→t),N(t→u);N(t)]`. Since
it is a restriction that transitions be combined to produce one
sequence rather than a set of sequences, the transitions are not
entirely independent, but the dependency can be calculated and is
generally very small except for very short sequences.

When counting occurences of the word `w=w _{0}...w_{p}`, we can iterate this
process. Let

A paper containing the theoretical results appeared in Advances in Applied Probability in March 2006.

I have a manuscript in progress dealing with the practical use including the problem of DNA heterogeneity, although it has been put aside for a while awaiting publication of the theoretical results. Also, I am planing to make a computer program, probably in Java, and perhaps also as a web service, to do the computations.

Last modified
October 28, 2013.