Principal Components Analysis 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us
ClustanPCA ... offers a very flexible principal components analysis, allowing for continuous variables or mixed variable types, with or without missing vales.  It is also very efficient, being capable of handling hundreds of variables

Principal Components Analysis is a method for transforming the variables in a sample dataset into new variables which are uncorrelated with each other and account for decreasing proportions of the total variance of the original variables.  Each new variable is a linear combination of the original variables.

The first principal component is that linear combination of the original variables which accounts for the maximum amount of variance in a single line.  It is the line of best fit through the data, and the residual variance about this line is then a minimum for the data set.

Dialogue for ClustanPCAThe second principal component is that line which is orthogonal to the first principal component and accounts for the maximum amount of the remaining variance in the data, subject to being uncorrelated with the first principal component.  The first two components therefore represent the plane of best fit through the data.  It is the scatter diagram that most closely reflects the disposition of the points in the full p-dimensional space.

All remaining principal components are defined similarly, such that the lowest order components normally account for very little variance and can usually be ignored.

The eigenvalues obtained from Principal Components Analysis are equal to the variance explained by each of the principal components, in decreasing order of importance.  As the original variables are standardized for pca, they will each have a variance of 1.  It follows that any component with an eigenvalue of at least 1 explains more of the variance than any original variable.  A rule-of-thumb is, therefore, to select that number of principal components having an eigenvalue of at least 1.

Scatterplot showing the distribution of the first two principal components which together account for 95% of the variance in the Mammals case study.  This indicates a very good fit to the data.  The cases have been colour-coded according to the 7-cluster partition, for which Rabbit and Elephant are outliers.The eigenvectors are weightings which, when applied to the original data, obtain principal component scores for the observations.  A large positive or negative value indicates a variable that is correlated, either in a positive or a negative way, with the component.

It may be helpful to plot a scatter diagram of principal component scores for the first two principal components (showing left).  This diagram constitutes the best two-dimensional representation of the data, i.e. it is the plane of best fit through the scatter in a high dimensional space having the minimum residual variance about the plane.  It is, in some sense, the best two-dimensional "view" of the data.

How to run ClustanPCA

You can have a data set with continuous variables or mixed variables and missing values can be present.  We believe this degree of flexibility for principal components analysis to be unique.

Having read your data, click View/PCA to open the PCA dialogue (top).  This shows the number of cases and number of variables in your data set.Results from ClustanPCA for the Mammals dataset

You can select the number of dimensions you wish to save on completion.  However, if you leave this blank, ClustanPCA will save the number of components having an eigenvalue of at least 1.

Click "Start" and Principal Components Analysis will run.  On completion, click "Results" to obtain the eigenvalues and eigenvectors (right).  These can also be copied to the Clipboard for incorporation into a report or another program.  If you wish to view the principal component scores, click "Scores" in this dialogue.

Click "Finish" and the first two principal components will be plotted as a scatter diagram.  To obtain this plot you should also complete a hierarchical cluster analysis on your data, so that the partition highlighted on the tree will be colour coded by cluster for the first two principal components.

To view the principal component scores, click View/Data.  They can be found under the columns "Scatter1" to "ScatterN", to the right of the data, and can be selected on their own by de-selecting the box labelled "Display Data".  The PCA scores can also be copied to the Clipboard by clicking "Copy".

ClustanPCA Limitations

There are no fixed restrictions to ClustanPCA, but when the number of variables is large the analysis may take some time to run or may run out of internal memory.  As a rough guide, ClustanPCA requires about 10p2 bytes of memory, where p is the number of variables; hence if there are 100 variables, it will require 100k of internal storage.  A data matrix of 1000 variables will require 10MB of storage and take some time to complete a PCA analysis.

The number of cases is not a major limitation for ClustanPCA.

ClustanPCA is distributed with ClustanGraphics 7.04 onwards.

Clustan - A Class Act © 1998 Clustan Ltd.