Quality Control

There are a few steps in the quality control, where the earliest steps are not covered here: quality control from the sample preparation, micro array hybridization, and feature extraction are done prior to compiling the raw micro array expression data. The quality control steps covered here assume a cohort of micro array expression data made on a single array design, or at least using the same probes, and assume some degree of comparability across samples. The main aim is to identify arrays that are unreliable or have provided data of low quality.

Quality assessment here is based on two different versions of the data: probe level raw data from before normalisation, and normalised data in which quantile normalisation (or similar methods) have been used to remove systematic differences between arrays. One reason for using probe level data rather than expression values aggregated to gene level is that differing number of probes per gene may introduce biases which could be confusing during quality control; it is also easier and more straight forward to do so. As always, all data have been log-transformed.

I have generally done the spot (feature) level filtering, i.e. excluded low quality spots, before quality control. This choice is out of convenience and will generally have little impact on the quality control of the arrays as a whole since few spots are filtered away. I have then made an initial normalisation of the data series for quality control which would include arrays later excluded as unreliable or low quality, as well as sample replicate arrays where the better array is to be selected to represent the sample. After deciding on which arrays to include in the final data series, a final normalisation is run on these only. Generally, the difference between the initial and final normalisation should be minor.

There are a number of quality controls that can be made, some of which may depend on the type of samples included in the series. Thus, what I outline here is not intended to be comprehensive, but some standard tests that I would perform by default.

Comparison of each array against mean array

Use of housekeeping genes for quality control is quite well-established. These are genes that are expected to be consistently expressed across samples, and so large deviations may be seen as an indication of quality problems with the array data (be that from the sample, sample preparation, or array itself). Instead of selecting a small number of genes for this purpose, I have included all probes for a more comprehensive assessment, but the underlying idea is the same: for some probes/genes there may be considerable variation between samples, while some probes may deviate in individual samples, but many/most probes/genes will tend to have a consistent expression level across most samples. I defined the mean array to reflect the general expression level across samples, and compare arrays against this mean array to identify deviations. Let's explain in greater detail.

Mean array

For each probe, we compute the mean expression value across all arrays. I.e., if xik is the log-transformed signal from array i and probe k, we compute the mean signal μk by averaging over all i. We may also compute the standard deviation σk to differentiate between probes that display stable expression (e.g. housekeeping genes) and those that vary substantially between samples.

The mean array can be computed both for raw data and normalised data, but these will generally be very similar. I have generally used the mean array based on raw data when assessing raw array data, and mean array based on normalised data when assessing normalised array data, as is natural. I have used the raw array data for visual quality assessments as these are more easily interpreted. For numerical quality parameters, however, normalised array data are used.

Array deviation score

The array deviation score is intended as a measure of how strongly a particular array deviates from the mean array. It might be defined in any number of ways, but the simplest definition which takes into account both the deviation from the mean array of each probe and the stability/variability of each probe is Di = mean{(xikk)2k2} based on normalised signal values. A high array deviation score is an indication that the array may have low quality. As a rule of thumb, if an array has an array deviation score more than twice than average (i.e. the average deviation across included arrays), it will add more noise and variability than information to subsequent analyses.

Plots of array signals against mean array

Plotting the probe signals of an individual array against the mean signal of the mean array can give a good indication of the data quality from that particular array. For stable probes, i.e. probes that vary little between arrays, one could expect the signal in each individual array to correspond to that in the mean array.

The probe signal is for a large part a combination of two signals: one that is proportional to the gene expression, although with both noise and biases, and one that is background noise and which places an effective lower boundary of the detectable gene expression. The strength of the gene expression signal relative to the background noise may vary.

The plots show array signals per probe plotted agains mean signal across arrays. Colours indicate the standard deviation across arrays, with blue used for probes with stable signal values and light green for probes that tend to vary more. Ideally, the stable probes (blue) should lie on a narrow curve. Quantile normalisation will streighten out this curve to a diagonal line, while variation around the curve will lead to variation in the normalised signal.

Plot of array signal against mean probe signal across arrays for an "average" array of good quality.

An array of good quality, although with somewhat weaker signal-to-background ratio than average, hence the lower slope of the curve at low signal values as the background signal limits the ability to detect lower signals.

An array of good quality with stronger signal-to-background ratio than what is typical, hence the somewhat higher slope of the curve at low signal values: the mean signal across arrays is more strongly influenced by background noise than this particular array.

Quantile normalisation will effectively streighten out the curve in the above plots to fit the diagonal. Hence, the correspondence between array signal and mean array signal need not be linear. However, if the signal-to-noise ratio is low, the range of the low signal values will be stretched, and the noise will thus be increased in the normalised data relative to what it was in the raw data. If the signal-to-noise ratio is high, the effect will be the opposite of reducing the noise.

There are a number of effects that may cause noise, biases, and deviations in the data. Here are a few.

In addition to low signal-to-background ratio, the points are more spread out indicating stronger signal variation, most of which is likely noise (biological or technical). It is quite common that this variation is more prominent at low signal values, while for high signal values (i.e. high expression) the expression is more dominant and the noise less prominent.

A substantial number of probes have near maximal signal, which is most likely a technical problem with the array or from the hybridisation. Other arrays may have large number of probes with exceptionally low signal values, or substantial general noise/variation.

The very low signal values, which are dominated by the background signal, show considerable variation in this array. A common reason for this is a side-effect of the dome effect correction: there may be a systematic difference in hybridisation between the centre of the array and the edges which is adjusted for in the normalisation. When the signal values are increased in regions with low signal, the background signal is also increased. What was a fairly constant background signal level before dome effect adjustment, then results in a variable background signal as is seen in this array.

Principal component analysis and array clustering

Principal component analysis (PCA) and clustering methods are good at capturing systematic variation between arrays that affect a large number of probes. These methods can be used for quality controls in a few different respects. PCA plots, e.g. plotting PC1 against PC2 (or higher principal components), are good at capturing differences affecting a large number of probes in the same directions. Clustering, on the other hand, are good at identifying similar arrays or groups of arrays, where differences between subgroups may be less systematic and cannot easily be reduced to a few dimensions (like principal components).

Clustering of different tissue/tumour types, or tumour versus control
If the data set contains multiple tissue or tumour types, or contains tumour and normal controls, these should be expected to cluster together.
Centre/batch differences
If samples come from different centres, e.g. different hospitals, contain multiple batches, or there may be systematical differences introduced by other factors (lab techniques or technicians), these may be visible in PCA plots.

Principal component analysis (PCA) are good for capturing systematic effects and differences that affect a large number of probes. In this PCA plot of PC1 versus PC2, we can see the difference between normal and tumour samples in PC2.

PCA plot of PC2 versus PC3 gives a good clustering of the sample types: normals and three tumour types.

In this data set, PC1 largely captures technical effects, and correlate strongly to array quality as measured by the deviation from the mean array.

Generally, you would expect normal samples and tumours of the same type to form clusters, and if individual samples cluster "incorrectly" it may be a sign of problems: e.g. that samples or arrays have been misslabeled, or tumour samples with almost no tumour cells if it clusters with the normal samples.

Last modified February 18, 2015.