I have used a fairly standard procedure for normalising gene expression microarray data based on evaluation of the data I have been asked to normalise: Agilent one-colour oligonucleotide arrays for gene expression. For other technologies or usages, these choices may not be appropriate. In particular, these arrays have fairly long DNA probes, which behave rather differently from arrays with shorter probe sequences such as those produced by Affymetrix.
The normalisation steps are as summarised below. Prior to these, there have been quality controls performed in the laboratory and by the Feature Extraction program which interprets the scanned image of the array. The first steps, may be noted, are performed for each array. However, the main part of the normalisation is done for a series of arrays: e.g. a study cohort.
I only include the spots that actually target the sample material. Additional control spots and spots with spike-in probes (for control sample with known concentration) are excluded. The first step of quality control is to exclude individual spots (per array) which are deemed unreliable. The Feature Extraction software may have marked spots as non-uniform, meaning they were not circular or the intensity within the circle varied substantially; these, I exclude. For replicated probes, i.e. when there are multiple spots on the array with the same probe, indicidual probes may be marked as population outliers if they differ substantially from the rest; these spots, I also exclude.
The Feature Extraction program gives multiple alternatives for the signal intensities. First, the average signal within the spot is computed, and the program reports both the mean signal and median signal. The difference between these is so minor that there is no practical difference: I have used the mean signal. The Feature Extraction program also estimates the dome effect and the background noise and performs a correction which results in a processed signal. The removal of background noise has some severe side effects in that it tends to make the low signal values more noisy, and is generally a bad idea. The removal of the dome effect, however, is useful, and so I use the adjusted value meanSignal/multDetrend.
The raw signal intensities cover a very broad range of values. Most values are moderate, while others are fairly extreme. It is standard to logarithmically transform the signal intensities, and the base 2 logarithm has been used as is the most common choice. I do this before any further processing of the data.
Some probes are replicated: i.e. there are several spots with the same probes. These probes should in principle give the same signal, and in general they do with only minor differences, with a few individual deviating spots being excluded in the filtering step. I have averaged the spots for replicated probes, using the mean value (after log2-transformation) to summarise the signal per probe. This results in log-transformed signals per probe instead of per spot.
If you make a histogram or density plot (see figure to the right) of the number of probes with different signal intensities, you may find systematic differences between the arrays. Two of the main causes of this variation is the level of background signal found, and the general strength of the signals that lie above the background signal. However, there may be multiple other differences, often that affect probes at the extremes of the signal intensity range: mostly the low intensity probes in my experience. A simple, yet robust, way to remove this difference between the arrays is to perform quantile normalisation: this computes the average quantiles, i.e. sort of a standard density curve, and transforms the signals within each array to fit this.
After exclusion of low quality spots, some probes may have missing values in individual arrays. For some analyses, this is not a problem. However, a number of analysis methods require complete data or the entire probe or sample will be excluded from the analysis. The avoid this problem, missing values are imputed. I have used local least squares imputation (llsImpute from the R package pcaMethods). For a probe with a missing value, it identifies the k other probes that are most strongly correlated with it, and then perform a linear regression using these k probes to predict the missing value. I have set k=20 based on simulations done earlier on similar data: this indicated that 20 was a good and safe pick for reasonably well-sized data sets (100 or more arrays). For much smaller cohorts, e.g. less than 50 arrays, you should probably reduce k somewhat, and it should normally not exceed half the number of arrays.
Despite the preceeding normalisation steps, one must be cautious in considering the normalised signals as absolute measures of the gene expression. Due to variation in hybridisation affinities between probes, even for probes within the same gene or exon there will be systematic differences. Differences in sample handling and laboratory procedures may also induce systematic differences at the individual probe level. Even more severe are differences are those between different platforms, where not only the probes are different. These problems can often be reduced by centering the data: e.g. for each probe, subtracting the mean signal across all arrays (or within each batch/centre). What remains is a centred signal that indicates if the signal is high or low relative to other arrays in the series. These allow comparisons between different probes: be that different probes on the same gene, different batches or centres, or even different microarray platforms.
There may be multiple probes for some genes. These may target the same exon, different exons, alternative splice variants, or non-coding regions of the genome. For probes targeting the same gene, it is convenient with a single signal per gene rather than multiple values. I have decided on the simple approach of using the mean value. Since there will be systematic differences between probes targeting the same gene, this averaging is only performed on complete data after missing values have been imputed. The data need not be centred, although for uncentred data the same reservations apply as for uncentred probe values.
The choice if using the mean signal across all probes targeting the gene may not be optimal. However, it is a simple and robust solution. Also, upon checking I have found that, within each gene, probes that show substantial variation between samples tend to correlate well, while probes that show little variation usually do not correlate with the other probes. This is natural if we assume that probes that vary actually capture variations in gene expression, while probes that vary little mostly consist of random noise. The average will be dominated by the probes that vary more, and since these tend to correlate with each other, this average will capture the shared signal; probes that vary little will not correlated well with this, but probably consist mostly of noise anyway.