Normalisation of microarray gene expression data
The purpose of data normalisation in general is to remove systematic difference between data from different sources so as to make them comparable. For microarrays, in particular, it is to remove differences that are due to sample handling, laboratory procedures, technical differences between arrays, between different batches of arrays or platforms, between different centres and laboratories, and other systematic differences that are not due to actual biological differences between the samples.
Microarray technology and sources of errors
A microarray consist of an array with a large number of spots (or features): e.g. 60000 spots per array in recent Agilent arrays. In addition, several arrays may be placed on a single slide: e.g. 8 arrays on a slide for the 2×4 60k Agilent array. In each spot is bound a particular type of molecule which is referred to as the probe. For gene expression arrays, this is typically a short DNA sequence. These probes can bind RNA or DNA fragments in the sample which we refer to as the targets, e.g. mRNA from a sample. Typically, the mRNA from the sample material is coloured so that the amount of mRNA bound to the array can be estimated from a scanned image of the array. The brightness of the colour in the scanned image is referred to as the signal intensity and indicates the amount of target molecules bound to each spot.
Gene expression refers to the quantities of mRNA in a sample. Ideally, one might hope to get a measure of the actual quantity of different mRNAs, or at least the portion of each mRNA within the sample. There are a number of reasons why microarray technologies don't quite succeed at this: i.e. different sources of errors and biases.
The spike-in reagent contains known probes of known concentration and is used to assess the microarray data. Ideally, the signal should be proportional to the concentration. For high concentrations this is approximately true, although with a little variation between the replicates and some between the probes. However, at low concentrations the signal levels out as the background signal becomes dominant. Attempts to remove the background noise/signal may provide better estimates of the actual concentration, but at the cost substantially inflating the variation between spots.
- Uneven amplification of mRNAs or other laboratory induced biases
- Often, the original sample contains insufficient amounts of mRNA, and the amount is amplified using PCR. This amplification may amplify some mRNAs more efficiently than others, thus skewing the relative amounts of mRNAs relative to what was in the original sample. Other laboratory procedures, e.g. for extracting sample sequences, or uneven degradation of mRNAs in the original sample may induce other biases.
- Variations in hybridisation affinity
- The binding affinity between the probe and the target varies. For probes with high binding affinity, the signal intensity will exagerate the gene expression relative to probes with low binding affinity.
- Background noise and signal, false/unspecific hybridisation
- The array is not completely black to start with, so even with no hybridisation the signal will not be completely zero: this is often referred to as the background noise and is visible as a non-zero signal outside the spots. More critically, within the spots there will often be a distinct signal which may come from false hybridisation: i.e. binding of molecules other than the target. This can be due to similar RNA/DNA fragments from other parts of the genome, or short fragments from degraded RNA/DNA which can bind in a non-specific manner and induce a background signal across most spots.
- Other technical biases
- A spot has a limited number of probe molecules, and so there is an upper bound to the amount of target molecules it can bind. The image scanning of the array also has an upper bound on how strong the signal can get. This may result in saturation of the spot as the signal reaches this upper bound and more mRNA no longer results in a strong signal. There may also be spatial variations on the array, either in the form of regions of the array which may be damaged, or the dome effect in which the centre of the array differs from the edges as the array is unevenly covered by sample material during the hybridisation phase.
Some envision that normalisation could retrieve a fair and unbiased estimate of the gene expression level: i.e. the relative amount of each mRNA in the original sample material. This is, however, unrealistic and misguided. In particular, it is impossible to distinguish between unexpressed genes and expression levels that fall below the background signal. Instead, the aim of the normalisation is to remove, or at least reduce, systematic difference between different arrays so that these can be compared.