Most of the following is specific to Agilent gene expression arrays: in particular everything that is related to annotation of the array and probes.
At the time of writing, the latest design used for human gene expression is SurePrint G3 Human Gene Expression 8×60K v2 Microarray: 8×60K indicates that the slide has 8 arrays (here in a 2×4 layout) and contains approximately 60000 spots. This array has Agilent product number G4851B and the probe layout has DesignID 039494; it superceedes G4851A which used the same physical slide of arrays but had design 028004. See the Agilent overview for more details.
Each array has a unique ArrayID: e.g.
252800411780_1_1. The pattern of this identifier is
ddddd are the last 5 digits of the DesignID (e.g. 28004 for design 028004),
sssss (e.g. 11780) identifies the slide, and
r_c (or only
c on some slide layouts) identifies the array on the slide.
In the feature extraction files, as well as in the array annotation file, the probes are identified and annotated using a few central variables:
A_33_P3337485) which uniquely identifies the probe. Unlike the ProbeUID which just enumerates the probes in a particular design and thus do not identify probes across different designs, the ProbeName can be used across to identify probes across all array designs. This is the chosen probe identifier!
Basically, the signal intensities at any stage of the normalisation procedure consists of a matrix of numerical values: the signal intensities per array and probe (or spot or gene). The file format I use, which is fairly default, is a tabulator separated text file with a header line containing the array-sample identifiers, on column containing the probe, spot or gene identifiers, and the rest of the matrix filled with the signal intensities. Missing values are indicated by an empty cell. In the first column of the header row, I normally put a string on the form
[row identifier]*[column identifier]=[value], e.g.
ProbeID*ArrayID=gDetrendedSignal, which describes the three.
The array identifier may be the ArrayID which identifies the array, but normally I prefer to have this also identify the sample so that the data file can be tied to sample without further array-sample matching being required: this is not only convenient, but reduces the risk of mismatches occurring at a later stage. I typically use a array&sample identifyer on the form
[SampleID]:[ArrayID] where I may trim the design identifier from the ArrayID to shorten it when only one design is in use. Below is an example illustrating the file format used.
ProbeID*SampleArray=detrendedSignal MA002:18459-3 MA003:18446-1 MA014:27854-1 MA015:18433-2 A_23_P215419 11.9564281 12.3133295 10.840037 12.3858758 A_24_P66027 7.60438573 8.82334107 7.63959262 8.76406732 A_23_P145874 10.079337 10.5011904 10.9547281 11.0769737 A_32_P77178 6.73769706 6.47212716 6.2394798 7.39850138 A_23_P212522 9.59698928 8.39066235 7.88939615 8.72842292