Normalisation of microarray gene expression data

R scripts for performing the normalisation

Originally, I did most of the data processing in SAS, exported the raw data after filtering and log-transformation to a text file, used R to run the normalisation steps at the probe level, and then used SAS to add probe annotation and aggregate signal values to gene level. However, I have expanded on this to handle all the normalisation steps: from reading in the raw text files produced by the Agilent Feature Extraction program to the gene level files.

The R scripts and other files are available from the download page.

The normalisation requires the following files apart from the R scripts:

Array data from the Feature Extraction program
The Feature Extraction program produces a text file per array with the ArrayID as part of the file name, which are the main input. This contains different signal intensities, background noise estimates, and different quality flags for each spot in addition to various metadata about the array. Note that Feature Extraction can export this data in a compact format with a reduced set of data columns, but this does not contain the multiplicative detrending factor which I need to remove the dome effect, and so should not be selected.
File and sample annotation file(s)
The script requires a list of feature extraction files to read. This may, of course, be generated by R from the directory contents of one or more provided directories, but I have generally had this list as a file that I have read into R. In addition, since I have wanted to identify arrays by a combination of SampleID and ArrayID, I have merged this list with sample annotation data and generated the final sample-array identifiers. From this, I have also been able to filter which arrays to include and which to exclude: doing this through a sample-array annotation file is more flexible and maintainable than doing so by organising the feature extraction files into different folders by inclusion criteria.
Array design annotation file
Information about the probes is included in the feature extraction files, but this relies on the array annotation file installed in the Feature Extraction program when the microarray was run and interpreted. I generally prefer to make sure I have a recent version of the annotation file, and use this to ensure consistent and updated gene symbols for all probes. This is only used for aggregating the data to gene level.

The script requires two R package: limma for reading the raw feature extraction files and performing quantile normalsation, and pcaMethods for missing value imputation (llsImpute). These may be installed from Bioconductor using the following R code:

source("http://bioconductor.org/biocLite.R")
biocLite(c("pcaMethods","limma"))

The core part of the script consists of the following R instructions:

## Import definitions and functions
source("MicroArray.r")
# ... Set up paths and annotations ...
## Read feature extraction files (uses read.maimages from limma)
rawdata=read.agilent(targets$Filename,path=rawdatadir,names=targets$SampleArrayID);
## Perform filtering, log-transformation, and aggregate probe data
rawdata=preprocess.agilent(rawdata);
## Set up ArrayData "object" and add raw data matrix to it
arrays=ArrayData.new('raw',rawdata$log2.probe,label='ProbeID*ArrayID=gDetrendedSignal');
## Write "raw" (non-normalised) data to file
ArrayData.write(arrays,file.path(Array.normalize.default.path,pastename(arrays.name,'raw_log2.dat')));
## Run full normalisation and write normalised results to files
arrays=Array.normalize(arrays.name,arraydata=arrays);
## Aggregate normalised data per gene and write to file
arrays.gene=ArrayData.aggregate(arrays,map=annotation,to='GeneSymbol',filename=arrays.gene.name);

The MicroArray.r file and additional R files this imports, define a simple data structure for storing the data through the different normalisation steps, for managing the data, and for performing the actual normalisation steps. Most of the functions are just simple methods for encapsulating a few single steps, while the main script consists mostly of the path definitions to ensure data is read from the right place and written to the right place, to ensure sample and array identifiers are properly matched to the normalised data, and that the appropriate set of arrays are included in the data set.

The full script is process_arrays.r which sets up paths and filenames, imports sample annotation and matches this to arrays, selects the set of arrays to be included in the final data set, and executes the normalisation methods defined in MicroArray.r. In addition, MicroArray.r contains some path definitions which are referred to in the script, and which you may want to change if you are to use this script to normalise your own data.

Summary of MicroArray elements

The main file is MicroArray.r whose main function is to import the remaining R files. It defines a few paths, the only important one being rprojroot which points to the folder that contains the MicroArray R script files. It defines ensure.path for construction a path and ensuring that it gets created if it does not already exist.

The function SubsetSelection is used to set up variables dataname which will be used as the main name of the data files, exportdir which is the folder in which they will be stored, and taglist which if defined will be used later to select a subset of the arrays.

MicroArrayReader.r

MicroArrayReader.r contains two important functions: read.agilent is a wrapper for read.maimages (from the limma package) which reads in the feature extraction files, and preprocess.agilent which extracts the desired signal values and returns a raw data matrix.

The preprocessing step computes log2(meanSignal/multDetrendSignal) as the log-transformed dome effect corrected signal, excludes flagged spots and control probes, and then computes the average per probe to return an data matrix with one column per sample and one row per probe. If you wish to preprocess the signal intensities differently, you will have to change this function: the above choices are all hard-coded into the preprocess.agilent function.

MicroArrayData.r

MicroArrayData.r defines the ArrayData "object" data structure I use to hold the array data throughout the normalisation steps: I say "object" because it is actually just an environment which holds the values and basically is used just as a list of named values. However, it functions as an object. Basically, the ArrayData object stores the data matrices, e.g. raw is used as the name of the initial raw data matrix. In addition, it keeps track of the list of matrices added, as well as the last one added.

The methods defined in this file are mostly simple methods for adding, retrieving, reading, and writing these data. No real heavy-lifting goes on here.

MicroArrayNormalize.r

MicroArrayNormalize.r defines the Array.normalize function which performs all the data normalisation steps. It can either take an ArrayData object (typically one with the raw data matrix added), or it will try to read it from file based on Array.normalize.default.path and a provided filename to which _log2.dat is automatically added. It then does quantile normalization using normalizeBetweenArrays from the limma package, missing value imputation using llsImpute from pcaMethods with parameter k (default is k=20, and finally does row and column centering. After each step, the resulting data matrix is saved to file, and the suffix raw is first replaced with norm, and then suffixes missimp and cent are added to indicate the subsequent steps.

MicroArrayAggregate.r

MicroArrayAggregate.r defines the function ArrayData.aggregate is used to aggregate signal matrices from probe level to gene level signal intensities, which it does by taking the mean, and writes them to file. It will only process matrices with no missing values.

Additional scripts

Merging data sets from different array designs

If you have microarray data from different array designs, the regular process_arrays.r script will not be sufficient. However, with modest modifications, it can be adapted for that use. One solution, used in a particular case, is the merge.r script.

Quality control script

There are a number of quality controls that can be run on the data. The script qc.r provides some plots that I have routinely used to make initial quality assessment of microarray data.

Test data

The version of the R scripts that have been make available here have been set up to process the GSE44666 array series. Download the GSE44666_RAW.tar file from GEO, unzip the files into a folder of your choice, modify the paths of the script to fit the location of the file. You then need to download the sample-arrays annotation file which matches samples to arrays, and then you can try to do a step by step run of the process_arrays.r script to see what results: check the intermediate results to make sure you understand what is going on and that nothing is going wrong.

The sample-array annotation file contains a columns which indicates if the array should be included or not, and the process_arrays.r script applies this in selecting the arrays for inclusion.

Last modified November 05, 2014.