Analysis of microarray gene expression data

This section is not ment to be an introduction to analysis of gene expression data: that would be a huge topic beyond the scope of these few pages. However, there are some general recommendations relevant for analysis which derive from the normalisation and quality control.

Selecting samples and arrays to include

Generally, the normalisation procedure is very stable with respect to which samples and arrays are included in the normalisation steps. In general, there will, in my experience, be very little difference between data sets for which subset selection is done first and normalisation done afterwards, and data sets where normalisation is done first and subset selection afterwards.

My generally preferred approach has been to perform quality controls to determine which arrays to include in the main data set (one array per sample), but include all sample types (normal samples, tumour types) in one normalisation. This approach is convenient and robust. If a subgroup of samples, e.g. one of the tumour types, is to be published separately, it may be convenient for presentation purposes to normalise this separately, but it should not have any substantial impact on results. Conversely, arrays that are to be compared should be normalised together.

Later analyses may then be performed on subsets of samples.

Filtering probes for further analyses

A large portion of the probes display little variation between samples. This is particularly true for probes whose signal values are around or below the background noise level. A good first step is to exclude such probes from further analysis since these are unlikely to contribute anything but noise.

Mean probe signal, and standard deviation, in normalised data. The majority of probes display little variation. There is a lower limit to the variation which is due to random variation and background signal, and probes around this level are unlikely to carry much information about the individual sample, and may instead contribute noise to any analyses if included.

Probes with signal levels generally close to the background noise level are particularly important to filter out since these are particularly prone to technical artifacts. Even arrays of fairly bad quality may give fairly good expression data for high-expression probes, while low-expression probes primarilly capture the background signal which is wholy technical and may well induce spurious results.

I have often used a simple cutoff on the probe standard deviation across samples to filter the probes. In some analyses, this cutoff may be applied strictly to ensure only the most informative probes are included, while in other cases a more relaxed criterion may be applied which would allow more probes to enter into analysis but placed higher demands on avoiding problems of model overfitting and multiple testing.

Last modified February 18, 2015.