The Blosum62 matrix is for scoring protein alignments. It is based on alignments with identity of at least 62%.
This page is a working-document for information related to making protein alignments.
A random pair of amino acids (uniform probability rather than actual amino acid frequencies) will give an average score of Mean=-1.0675 (SD=2.18127); this makes Mean/SD=-0.48939.
The random perfect match score, ie. the diagonal of the scoring matrix, has an average Mean=5.75 (SD=1.84).
Heuristic methods for aligning protein sequences, eg. BLAST, gain speed from identifying segments with positive scores: high scoring pairs (HSPs).
If two sequences have lengths n and m, the number of word pairs of a given length will be in the order of nm. If a fraction, p, of these are considered to be HSPs, the expected number of HSPs will be pnm.
There are 20 amino acids, hence, 20k k-words; this makes for 202k pairs of k-words. We may then find the Blosum62-scores of these pairs and find how many have high scores. For simplicity, we ignore amino acid frequencies.
If we have identified N HSPs of k-words, for a random word, the expected number of matches is N/20k; if we make a dot-plot of k-words, the fraction of pairs that are HSPs are N/202k.
| Average number of matches with score no less than | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| k | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 1 | 5.60 | 3.10 | 1.90 | 1.20 | 1.00 | .700 | .450 | .250 | .150 | .100 |
| 2 | 89.3 | 63.4 | 46.2 | 34.1 | 24.5 | 16.9 | 11.4 | 7.52 | 5.07 | 3.22 |
| 3 | 1595 | 1210 | 910 | 674 | 489 | 350 | 247 | 172 | 118 | 78.8 |
| 4 | 28294 | 21848 | 16654 | 12524 | 9297 | 6816 | 4936 | 3529 | 2492 | 1740 |
| 5 | 4.96e5 | 3.88e5 | 2.99e5 | 2.28e5 | 1.72e5 | 1.29e5 | 95114 | 69581 | 50439 | 36262 |
| 6 | 8.69e6 | 6.85e6 | 5.35e6 | 4.13e6 | 3.16e6 | 2.39e6 | 1.80e6 | 1.34e6 | 9.90e5 | 7.26e5 |
| 7 | 1.52e8 | 1.21e8 | 9.53e7 | 7.44e7 | 5.75e7 | 4.41e7 | 3.36e7 | 2.54e7 | 1.90e7 | 1.42e7 |
| 8 | 2.68e9 | 2.14e9 | 1.70e9 | 1.34e9 | 1.04e9 | 8.08e8 | 6.21e8 | 4.75e8 | 3.60e8 | 2.71e8 |
| k | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 1 | .050 | 0.050 | - | - | - | - | - | - | - | - |
| 2 | 2.01 | 1.15 | .648 | .388 | .222 | .148 | .088 | .050 | .025 | .013 |
| 3 | 51.9 | 33.8 | 22.0 | 14.3 | 9.31 | 5.95 | 3.71 | 2.28 | 1.36 | .824 |
| 4 | 1205 | 829 | 566 | 384 | 257 | 171 | 112 | 72.4 | 46.5 | 29.6 |
| 5 | 25870 | 18315 | 12861 | 8953 | 6177 | 4224 | 2865 | 1927 | 1286 | 852 |
| 6 | 5.28e5 | 3.82e5 | 2.74e5 | 1.95e5 | 1.37e5 | 96218 | 66935 | 46236 | 31715 | 21606 |
| 7 | 1.05e7 | 7.68e6 | 5.59e6 | 4.05e6 | 2.91e6 | 2.08e6 | 1.47e6 | 1.04e6 | 7.28e5 | 5.07e5 |
| 8 | 2.03e8 | 1.51e8 | 1.11e8 | 8.17e7 | 5.96e7 | 4.32e7 | 3.12e7 | 2.23e7 | 1.59e7 | 1.13e7 |
| k | 20 | 25 | 30 | 35 | 40 | 45 | 50 | 60 | 70 | 80 |
| 1 | - | - | - | - | - | - | - | - | - | - |
| 2 | .008 | - | - | - | - | - | - | - | - | - |
| 3 | .481 | .026 | 8.75e-4 | - | - | - | - | - | - | - |
| 4 | 18.6 | 1.61 | .105 | 4.91e-3 | 1.44e-4 | - | - | - | - | - |
| 5 | 559 | 61.1 | 5.41 | .386 | .021 | 8.48e-4 | 2.22e-5 | - | - | - |
| 6 | 14619 | 1874 | 202. | 18.4 | 1.39 | .085 | 4.13e-3 | 3.42e-6 | - | - |
| 7 | 3.51e5 | 50740 | 6324 | 680. | 62.7 | 4.94 | .327 | 7.75e-4 | 5.42e-7 | - |
| 8 | 7.93e6 | 1.26e6 | 1.76e5 | 21514 | 2302 | 215. | 17.5 | .073 | 1.42e-4 | 8.43e-8 |
When matching two genes, what matters is how often matching HSPs are found, and how many. The table below gives the inverse frequency, ie. 202k/N. For genes of lengths around 300 bp, the number of word pairs will be appr. 100 000; hence, if HSPs are expected more often than once per 100 000, they are expected to be found in most pairs of genes.
| Rate (ie. 1 per N) of matches with score no less than | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| k | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||||||||||
| 1 | 3.57 | 6.45 | 10.5 | 16.7 | 20.0 | 28.6 | 44.4 | 80.0 | 133 | 200 | |||||||||||
| 2 | 4.48 | 6.31 | 8.66 | 11.7 | 16.4 | 23.7 | 35.1 | 53.2 | 78.9 | 124 | |||||||||||
| 3 | 5.01 | 6.61 | 8.79 | 11.9 | 16.4 | 22.9 | 32.4 | 46.5 | 68.1 | 102 | |||||||||||
| 4 | 5.65 | 7.32 | 9.61 | 12.8 | 17.2 | 23.5 | 32.4 | 45.3 | 64.2 | 91.9 | |||||||||||
| 5 | 6.45 | 8.25 | 10.7 | 14.0 | 18.6 | 24.9 | 33.6 | 46.0 | 63.4 | 88.2 | |||||||||||
| 6 | 7.36 | 9.34 | 12.0 | 15.5 | 20.3 | 26.7 | 35.6 | 47.8 | 64.6 | 88.2 | |||||||||||
| 7 | 8.39 | 10.6 | 13.4 | 17.2 | 22.3 | 29.0 | 38.1 | 50.5 | 67.3 | 90.4 | |||||||||||
| 8 | 9.55 | 11.9 | 15.1 | 19.2 | 24.5 | 31.7 | 41.2 | 53.9 | 71.1 | 94.4 | |||||||||||
| k | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |||||||||||
| 1 | 400 | 400 | - | - | - | - | - | - | - | - | |||||||||||
| 2 | 199 | 349 | 618 | 1032 | 1798 | 2712 | 4571 | 8000 | 16000 | 32000 | |||||||||||
| 3 | 154 | 237 | 363 | 558 | 859 | 1344 | 2155 | 3514 | 5869 | 9712 | |||||||||||
| 4 | 133 | 193 | 283 | 417 | 621 | 938 | 1432 | 2209 | 3437 | 5402 | |||||||||||
| 5 | 124 | 175 | 249 | 357 | 518 | 758 | 1117 | 1660 | 2487 | 3757 | |||||||||||
| 6 | 121 | 168 | 234 | 329 | 466 | 665 | 956 | 1384 | 2018 | 2962 | |||||||||||
| 7 | 122 | 167 | 229 | 316 | 440 | 616 | 869 | 1232 | 1759 | 2527 | |||||||||||
| 8 | 126 | 170 | 230 | 313 | 429 | 592 | 822 | 1147 | 1610 | 2273 | |||||||||||
| k | 20 | 25 | 30 | 35 | 40 | 45 | 50 | 60 | 70 | 80 | |||||||||||
| 1 | - | - | - | - | - | - | - | - | - | - | |||||||||||
| 2 | 53333 | - | - | - | - | - | - | - | - | - | |||||||||||
| 3 | 16619 | 3.06e5 | 9.14e6 | - | - | - | - | - | - | - | |||||||||||
| 4 | 8580 | 99176 | 1.52e6 | 3.26e7 | 1.11e9 | - | - | - | - | - | |||||||||||
| 5 | 5720 | 52358 | 5.92e5 | 8.30e6 | 1.51e8 | 3.78e9 | 1.44e11 | - | - | - | |||||||||||
| 6 | 4378 | 34146 | 3.16e5 | 3.48e6 | 4.62e7 | 7.50e8 | 1.55e10 | 1.87e13 | - | - | |||||||||||
| 7 | 3652 | 25226 | 2.02e5 | 1.88e6 | 2.04e7 | 2.59e8 | 3.91e9 | 1.65e12 | 2.36e15 | - | |||||||||||
| 8 | 3227 | 20240 | 1.45e5 | 1.19e6 | 1.11e7 | 1.19e8 | 1.46e9 | 3.51e11 | 1.80e14 | 3.04e17 | |||||||||||