Originally, I did most of the data processing in SAS, exported the raw data after filtering and log-transformation to a text file, used R to run the normalisation steps at the probe level, and then used SAS to add probe annotation and aggregate signal values to gene level. However, I have expanded on this to handle all the normalisation steps: from reading in the raw text files produced by the Agilent Feature Extraction program to the gene level files.
The R scripts and other files are available from the download page.
The normalisation requires the following files apart from the R scripts:
The script requires two R package: limma for reading the raw feature extraction files and performing quantile normalsation, and pcaMethods for missing value imputation (llsImpute). These may be installed from Bioconductor using the following R code:
The core part of the script consists of the following R instructions:
## Import definitions and functions source("MicroArray.r") # ... Set up paths and annotations ... ## Read feature extraction files (uses read.maimages from limma) rawdata=read.agilent(targets$Filename,path=rawdatadir,names=targets$SampleArrayID); ## Perform filtering, log-transformation, and aggregate probe data rawdata=preprocess.agilent(rawdata); ## Set up ArrayData "object" and add raw data matrix to it arrays=ArrayData.new('raw',rawdata$log2.probe,label='ProbeID*ArrayID=gDetrendedSignal'); ## Write "raw" (non-normalised) data to file ArrayData.write(arrays,file.path(Array.normalize.default.path,pastename(arrays.name,'raw_log2.dat'))); ## Run full normalisation and write normalised results to files arrays=Array.normalize(arrays.name,arraydata=arrays); ## Aggregate normalised data per gene and write to file arrays.gene=ArrayData.aggregate(arrays,map=annotation,to='GeneSymbol',filename=arrays.gene.name);
The MicroArray.r file and additional R files this imports, define a simple data structure for storing the data through the different normalisation steps, for managing the data, and for performing the actual normalisation steps. Most of the functions are just simple methods for encapsulating a few single steps, while the main script consists mostly of the path definitions to ensure data is read from the right place and written to the right place, to ensure sample and array identifiers are properly matched to the normalised data, and that the appropriate set of arrays are included in the data set.
The full script is process_arrays.r which sets up paths and filenames, imports sample annotation and matches this to arrays, selects the set of arrays to be included in the final data set, and executes the normalisation methods defined in MicroArray.r. In addition, MicroArray.r contains some path definitions which are referred to in the script, and which you may want to change if you are to use this script to normalise your own data.
The main file is
whose main function is to import the remaining R files. It defines a few paths, the only important one being
rprojroot which points to the folder that contains the MicroArray R script files. It defines
ensure.path for construction a path and ensuring that it gets created if it does not already exist.
SubsetSelection is used to set up variables
dataname which will be used as the main name of the data files,
exportdir which is the folder in which they will be stored, and
taglist which if defined will be used later to select a subset of the arrays.
contains two important functions:
read.agilent is a wrapper for
read.maimages (from the
limma package) which reads in the feature extraction files, and
preprocess.agilent which extracts the desired signal values and returns a raw data matrix.
The preprocessing step computes
log2(meanSignal/multDetrendSignal) as the log-transformed dome effect corrected signal, excludes flagged spots and control probes, and then computes the average per probe to return an data matrix with one column per sample and one row per probe. If you wish to preprocess the signal intensities differently, you will have to change this function: the above choices are all hard-coded into the
defines the ArrayData "object" data structure I use to hold the array data throughout the normalisation steps: I say "object" because it is actually just an environment which holds the values and basically is used just as a list of named values. However, it functions as an object. Basically, the ArrayData object stores the data matrices, e.g.
raw is used as the name of the initial raw data matrix. In addition, it keeps track of the list of matrices added, as well as the last one added.
The methods defined in this file are mostly simple methods for adding, retrieving, reading, and writing these data. No real heavy-lifting goes on here.
Array.normalize function which performs all the data normalisation steps. It can either take an ArrayData object (typically one with the raw data matrix added), or it will try to read it from file based on
Array.normalize.default.path and a provided
filename to which
_log2.dat is automatically added. It then does quantile normalization using
normalizeBetweenArrays from the
limma package, missing value imputation using
pcaMethods with parameter
k (default is
k=20, and finally does row and column centering. After each step, the resulting data matrix is saved to file, and the suffix
raw is first replaced with
norm, and then suffixes
cent are added to indicate the subsequent steps.
defines the function
ArrayData.aggregate is used to aggregate signal matrices from probe level to gene level signal intensities, which it does by taking the mean, and writes them to file. It will only process matrices with no missing values.
If you have microarray data from different array designs, the regular
script will not be sufficient. However, with modest modifications, it
can be adapted for that use. One solution, used in a particular case, is the
There are a number of quality controls that can be run on the data. The script qc.r provides some plots that I have routinely used to make initial quality assessment of microarray data.
The version of the R scripts that have been make available here have been set up to process the GSE44666 array series. Download the GSE44666_RAW.tar file from GEO, unzip the files into a folder of your choice, modify the paths of the script to fit the location of the file. You then need to download the sample-arrays annotation file which matches samples to arrays, and then you can try to do a step by step run of the process_arrays.r script to see what results: check the intermediate results to make sure you understand what is going on and that nothing is going wrong.