Normalisation of microarray gene expression data

Probe, array and sample identifiers

Most of the following is specific to Agilent gene expression arrays: in particular everything that is related to annotation of the array and probes.

Array design and identifiers

The array design refers not only to the physical shape and size of the arrays, but to the layout of probes on the array. The probes are oligonucleotides that are printed onto the array one nucleotide at a time, and this layout of probes may change even if the array is otherwise physically identical: in this case the DesignID will change.

At the time of writing, the latest design used for human gene expression is SurePrint G3 Human Gene Expression 8×60K v2 Microarray: 8×60K indicates that the slide has 8 arrays (here in a 2×4 layout) and contains approximately 60000 spots. This array has Agilent product number G4851B and the probe layout has DesignID 039494; it superceedes G4851A which used the same physical slide of arrays but had design 028004. See the Agilent overview for more details.

Each array has a unique ArrayID: e.g. 252800411780_1_1. The pattern of this identifier is 25dddddsssss_r_c where ddddd are the last 5 digits of the DesignID (e.g. 28004 for design 028004), sssss (e.g. 11780) identifies the slide, and r_c (or only c on some slide layouts) identifies the array on the slide.

Spot, probe and gene annotation

In the feature extraction files, as well as in the array annotation file, the probes are identified and annotated using a few central variables:

FeatureNum, Row:Col
The FeatureNum identifies the spot (feature), and simply enumerates the spots on the array. These could also be identified by Row and Column. This is an identification of the physical spot location on the array, but says nothing about the probe found in that spot.
This is a numerical identifier of the probes within a particular array design, but is not persisten across different designs. Hence, it should not be used since there is risk of mismatching the probes at some later stage or if different array designs. In my earlier scripts, I used this identifier in initial processing of arrays and added presistent probe or gene identifiers later, but this was a bad solution and failed once I got a series of arrays which were from two different array designs.
This is a presisten probe identifier: a short text string (e.g. A_33_P3337485) which uniquely identifies the probe. Unlike the ProbeUID which just enumerates the probes in a particular design and thus do not identify probes across different designs, the ProbeName can be used across to identify probes across all array designs. This is the chosen probe identifier!
GeneSymbol, GeneName, etc.
For probes that target a protein coding gene, this will be a corresponding gene symbol in the design annotation file. Other probes use other identifiers, but still that identify the target of the probe, and the more general column GeneName found in the feature extraction will use these instead when gene symbol is not available. In addition, further annotation of the targets (proteins, genes, etc.) are available.

File formats

Microarray signal data matrix file

Basically, the signal intensities at any stage of the normalisation procedure consists of a matrix of numerical values: the signal intensities per array and probe (or spot or gene). The file format I use, which is fairly default, is a tabulator separated text file with a header line containing the array-sample identifiers, on column containing the probe, spot or gene identifiers, and the rest of the matrix filled with the signal intensities. Missing values are indicated by an empty cell. In the first column of the header row, I normally put a string on the form [row identifier]*[column identifier]=[value], e.g. ProbeID*ArrayID=gDetrendedSignal, which describes the three.

The array identifier may be the ArrayID which identifies the array, but normally I prefer to have this also identify the sample so that the data file can be tied to sample without further array-sample matching being required: this is not only convenient, but reduces the risk of mismatches occurring at a later stage. I typically use a array&sample identifyer on the form [SampleID]:[ArrayID] where I may trim the design identifier from the ArrayID to shorten it when only one design is in use. Below is an example illustrating the file format used.

ProbeID*SampleArray=detrendedSignal MA002:18459-3 MA003:18446-1 MA014:27854-1 MA015:18433-2
A_23_P215419	11.9564281	12.3133295	10.840037	12.3858758
A_24_P66027	7.60438573	8.82334107	7.63959262	8.76406732
A_23_P145874	10.079337	10.5011904	10.9547281	11.0769737
A_32_P77178	6.73769706	6.47212716	6.2394798	7.39850138
A_23_P212522	9.59698928	8.39066235	7.88939615	8.72842292
Last modified March 03, 2014.