Alignment scoring & Blosum62

The Blosum62 matrix is for scoring protein alignments. It is based on alignments with identity of at least 62%.

This page is a working-document for information related to making protein alignments.

Average scores

A random pair of amino acids (uniform probability rather than actual amino acid frequencies) will give an average score of Mean=-1.0675 (SD=2.18127); this makes Mean/SD=-0.48939.

The random perfect match score, ie. the diagonal of the scoring matrix, has an average Mean=5.75 (SD=1.84).

Number of High Scoring Pairs

Heuristic methods for aligning protein sequences, eg. BLAST, gain speed from identifying segments with positive scores: high scoring pairs (HSPs).

If two sequences have lengths n and m, the number of word pairs of a given length will be in the order of nm. If a fraction, p, of these are considered to be HSPs, the expected number of HSPs will be pnm.

There are 20 amino acids, hence, 20k k-words; this makes for 202k pairs of k-words. We may then find the Blosum62-scores of these pairs and find how many have high scores. For simplicity, we ignore amino acid frequencies.

If we have identified N HSPs of k-words, for a random word, the expected number of matches is N/20k; if we make a dot-plot of k-words, the fraction of pairs that are HSPs are N/202k.

Average number of matches with score no less than

k 01234 56789
1 5.603.101.901.201.00 .700.450.250.150.100
2 89.363.446.234.124.5 16.911.47.525.073.22
3 15951210910674489350 24717211878.8
4 282942184816654125249297 68164936352924921740
5 4.96e53.88e52.99e52.28e51.72e5 1.29e595114695815043936262
6 8.69e66.85e65.35e64.13e63.16e6 2.39e61.80e61.34e69.90e57.26e5
7 1.52e81.21e89.53e77.44e75.75e7 4.41e73.36e72.54e71.90e71.42e7
8 2.68e92.14e91.70e91.34e91.04e9 8.08e86.21e84.75e83.60e82.71e8

k 1011121314 1516171819
1 .0500.050 - - - - - - - -
2 2.011.15.648.388.222 .148.088.050.025.013
3 51.933.822.014.39.31 5.953.712.281.36.824
4 1205829566384257 17111272.446.529.6
5 25870183151286189536177 4224286519271286852
6 5.28e53.82e52.74e51.95e51.37e5 9621866935462363171521606
7 1.05e77.68e65.59e64.05e62.91e6 2.08e61.47e61.04e67.28e55.07e5
8 2.03e81.51e81.11e88.17e75.96e7 4.32e73.12e72.23e71.59e71.13e7

k 2025303540 4550607080
1 - - - - - - - - - -
2 .008 - - - - - - - - -
3 .481.0268.75e-4 - - - - - - -
4 18.61.61.1054.91e-31.44e-4 - - - - -
5 55961.15.41.386.021 8.48e-42.22e-5 - - -
6 146191874202.18.41.39 .0854.13e-33.42e-6 - -
7 3.51e5507406324680.62.7 4.94.3277.75e-45.42e-7 -
8 7.93e61.26e61.76e5215142302 215.17.5.0731.42e-48.43e-8

When matching two genes, what matters is how often matching HSPs are found, and how many. The table below gives the inverse frequency, ie. 202k/N. For genes of lengths around 300 bp, the number of word pairs will be appr. 100 000; hence, if HSPs are expected more often than once per 100 000, they are expected to be found in most pairs of genes.

Rate (ie. 1 per N) of matches with score no less than

k 01234 56789
1 3.576.4510.516.720.0 28.644.480.0133200
2 4.486.318.6611.716.4 23.735.153.278.9124
3 5.016.618.7911.916.4 22.932.446.568.1102
4 5.657.329.6112.817.2 23.532.445.364.291.9
5 6.458.2510.714.018.6 24.933.646.063.488.2
6 7.369.3412.015.520.3 26.735.647.864.688.2
7 8.3910.613.417.222.3 29.038.150.567.390.4
8 9.5511.915.119.224.5 31.741.253.971.194.4

k 1011121314 1516171819
1 400400 - - - - - - - -
2 19934961810321798 2712457180001600032000
3 154237363558859 13442155351458699712
4 133193283417621 9381432220934375402
5 124175249357518 7581117166024873757
6 121168234329466 665956138420182962
7 122167229316440 616869123217592527
8 126170230313429 592822114716102273

k 2025303540 4550607080
1 - - - - - - - - - -
2 53333 - - - - - - - - -
3 166193.06e59.14e6 - - - - - - -
4 8580991761.52e63.26e71.11e9 - - - - -
5 5720523585.92e58.30e61.51e8 3.78e91.44e11 - - -
6 4378341463.16e53.48e64.62e7 7.50e81.55e101.87e13 - -
7 3652252262.02e51.88e62.04e7 2.59e83.91e91.65e122.36e15 -
8 3227202401.45e51.19e61.11e7 1.19e81.46e93.51e111.80e143.04e17
Last modified June 21, 2007.