PCA Module for Python


Contents

  1. Introduction
  2. Download
  3. Documentation
  4. Installation
  5. Usage
  6. Testing
  7. Updates

1. Introduction

Implementation of a PCA module in python using numpy, scipy and python extensions (here, in C). The module carries out Principal Component Analysis (PCA) using either Singular Value Decomposition (SVD) or the NIPALS algorithm.

I chose to implement the NIPALS algorithm in C, because it is supposed to be faster on larger data sets. The user can choose the number of PCs that are to be calculated. And the scipy package already comes with a SVD method.


2. Download

The PCA Module includes a PCA made for numpy and scipy, and also a limited PCA only made for Numeric. You can choose which one you want to install by editing the setup.py file.

PCA Module 1.1.02 (source code w/ distutils setup)

PCA Module 1.1 (source code w/ distutils setup)

PCA Module 1.0 (source code w/ distutils setup)


3. Documentation

PCA can be used to reduce multidimensional data to fewer dimensions, while preserving the most important information during the process. After that it can be used for exploratory data analysis or to make predictive models.

See pca_nipals.pdf in the doc folder for more information about PCA, Nipals and Correlation Loadings: pca_nipals.pdf

Go to the API pages for details about each function in the PCA Module:
API Numpy version
API Numeric version


4. Installation

Standard distutils build and install:
$ python setup.py install

There are two variables that can be adjusted in setup.py. The first "add_ext" sets extension to be compiled and included (include: add_ext = True). If you set it to False, C python extension will not be included and you cannot access c_nipals. PCA can be calculated without the C python extension. The other variable "old_numeric" sets which version to use. Either old numeric version or the numpy version (use numpy version: old_numeric = False).

The old numeric version is more limited when it comes to functions. If possible the PCA module for scipy and numpy should be used.

To test that installation did not fail, try:
$ python
...
>>> import pca_module
>>>

No errors or exceptions should appear.


5. Usage

The most difficult part about PCA is using it well. This involves: formulating the problem well, choosing the best possible variables, scores/values must be gathered, and finally, after PCA calculation, explore the data and analyse plots you can make. You should know a little about PCA before you start using this module. I will not go far into these areas of PCA here, but show how you can get the calculated data.

Assume you have a 2-dimensional data matrix ( X ) that holds your data you want to analyze. This matrix is of size n x p with n = number of objects and p = number of variables. Each row holds the values of an object and each column holds the value for a variable. Here you will see how to run the PCA on such a data matrix ( X ):

Examples usage (with explained variance for each PC):
>>> from pca_module import *
>>> T, P, explained_var = PCA_svd(X, standardize=True)
>>>

Examples usage 2 (with E-matrix for each PC):
>>> from pca_module import *
>>> T, P, E = PCA_nipals2(X, standardize=True, E_matrices=True)
>>>

Now both mean centering (always done) and standardization (standardize=True) of X has been done before PCA calculation. The PCA method returns two matrices and one array. The first matrix, T, are the so called PCA Scores, the second matrix, P, are the PCA Loadings, and the third and last element is an array of explained variance for each PC. You can read more about this under: 3. Documentation.

Here I have made some plots of the results, and short explanations about the content: plot example

Go to the functions list for details about each function in the PCA Module:
Functions list


6. Testing

After installation is complete, you can unit-test the module. With a python testing script called testing.py. There is also a method for time measurements in the testing script.

Running the testing script:
$ python testing.py

Or for Numeric install:
$ python testing_numeric.py

errors - problems with module (e.g. import error)
failures - functions returns wrong results


7. Updates

PCA Module 1.1.01 - february 2008
- Changed to using numpy.linalg instead of scipy.linalg.
- Added test in nipals. Before t was just set to first column of E. Now t will be set to a non-zero vector if possible.
- And some other minor fixes.

PCA Module 1.1 - oktober 2007
- You can now get all E-matrices after a PCA. See 'Usage' on how to do it. Updated for all nipals methods including python c extension.
- PCA_svd still only returns the explained variances.

PCA Module 1.0 - may 2007
- PCA using different packages and methods.


by: Henning Risvik, risvik@gmail.com, february 2008, University of Oslo