## Contents## 1. IntroductionImplementation of a PCA module in python using numpy, scipy and python extensions (here, in C). The module carries out Principal Component Analysis (PCA) using either Singular Value Decomposition (SVD) or the NIPALS algorithm. I chose to implement the NIPALS algorithm in C, because it is supposed to be faster on larger data sets. The user can choose the number of PCs that are to be calculated. And the scipy package already comes with a SVD method. ## 2. DownloadThe PCA Module includes a PCA made for numpy and scipy, and also a limited PCA only made for Numeric. You can choose which one you want to install by editing the setup.py file. PCA Module 1.1.02 (source code w/ distutils setup) PCA Module 1.1 (source code w/ distutils setup) PCA Module 1.0 (source code w/ distutils setup) ## 3. DocumentationPCA can be used to reduce multidimensional data to fewer dimensions, while preserving the most important information during the process. After that it can be used for exploratory data analysis or to make predictive models. See pca_nipals.pdf in the doc folder for more information about PCA, Nipals and Correlation Loadings: pca_nipals.pdf
Go to the API pages for details about each function in the PCA Module: ## 4. Installation
Standard distutils build and install: There are two variables that can be adjusted in setup.py. The first "add_ext" sets extension to be compiled and included (include: add_ext = True). If you set it to False, C python extension will not be included and you cannot access c_nipals. PCA can be calculated without the C python extension. The other variable "old_numeric" sets which version to use. Either old numeric version or the numpy version (use numpy version: old_numeric = False). The old numeric version is more limited when it comes to functions. If possible the PCA module for scipy and numpy should be used.
To test that installation did not fail, try: No errors or exceptions should appear. ## 5. UsageThe most difficult part about PCA is using it well. This involves: formulating the problem well, choosing the best possible variables, scores/values must be gathered, and finally, after PCA calculation, explore the data and analyse plots you can make. You should know a little about PCA before you start using this module. I will not go far into these areas of PCA here, but show how you can get the calculated data. Assume you have a 2-dimensional data matrix ( X ) that holds your data you want to analyze. This matrix is of size n x p with n = number of objects and p = number of variables. Each row holds the values of an object and each column holds the value for a variable. Here you will see how to run the PCA on such a data matrix ( X ):
Examples usage (with explained variance for each PC):
Examples usage 2 (with E-matrix for each PC): Now both mean centering (always done) and standardization (standardize=True) of X has been done before PCA calculation. The PCA method returns two matrices and one array. The first matrix, T, are the so called PCA Scores, the second matrix, P, are the PCA Loadings, and the third and last element is an array of explained variance for each PC. You can read more about this under: 3. Documentation. Here I have made some plots of the results, and short explanations about the content: plot example
Go to the functions list for details about each function in the PCA Module: ## 6. TestingAfter installation is complete, you can unit-test the module. With a python testing script called testing.py. There is also a method for time measurements in the testing script.
Running the testing script:
Or for Numeric install:
errors - problems with module (e.g. import error) ## 7. Updates
PCA Module 1.1.01 - february 2008
PCA Module 1.1 - oktober 2007
PCA Module 1.0 - may 2007 |