ClustanPCA ...
offers a very flexible principal components analysis, allowing for continuous variables or mixed variable types, with or without missing vales. It is also very efficient, being capable of handling hundreds of variablesPrincipal Components Analysis is a method for transforming the variables in a sample dataset into new variables which are uncorrelated with each other and account for decreasing
proportions of the total variance of the original variables. Each new variable is a linear combination of the original variables. The first principal component is that linear combination of the
original variables which accounts for the maximum amount of variance in a single line. It is the line of best fit through the data, and the residual variance about this line is then a minimum for the data set. The second principal component is that line which is orthogonal to the first principal component and
accounts for the maximum amount of the remaining variance in the data, subject to being uncorrelated with the first principal component. The first two components therefore represent the plane of best fit through
the data. It is the scatter diagram that most closely reflects the disposition of the points in the full pdimensional space. All remaining principal components are defined similarly, such that the lowest order
components normally account for very little variance and can usually be ignored. The eigenvalues obtained from Principal Components Analysis are equal to the variance explained by each of the principal
components, in decreasing order of importance. As the original variables are standardized for pca, they will each have a
variance of 1. It follows that any component with an eigenvalue of at least 1 explains more of the variance than
any original variable. A ruleofthumb is, therefore, to select that number of principal components having an eigenvalue of at least 1. The eigenvectors are
weightings which, when applied to the original data, obtain principal component scores for the observations. A large positive or negative value indicates a variable that is correlated, either in a positive or a negative way,
with the component. It may be helpful to plot a scatter diagram of principal component scores for the first two principal components (showing left). This diagram constitutes the best twodimensional
representation of the data, i.e. it is the plane of best fit through the scatter in a high dimensional space having the minimum residual
variance about the plane. It is, in some sense, the best twodimensional "view" of the data. How to run ClustanPCA
You can have a data set with continuous variables or mixed variables and missing values can be present. We
believe this degree of flexibility for principal components analysis to be unique. Having read your data, click View/PCA to open the PCA dialogue (top). This shows the number of cases and
number of variables in your data set.
You can select the number of dimensions you wish to save on completion. However, if you leave this blank, ClustanPCA will save the number of components having an eigenvalue of at least 1.
Click "Start" and Principal Components Analysis will run. On completion, click "Results" to obtain the eigenvalues and eigenvectors (right). These can also be copied to the Clipboard for incorporation into a
report or another program. If you wish to view the principal component scores, click "Scores" in this dialogue. Click "Finish" and the first two principal components will
be plotted as a scatter diagram. To obtain this plot you should also complete a hierarchical cluster analysis on your data, so that the partition highlighted on the tree
will be colour coded by cluster for the first two principal components. To view the principal component scores, click View/Data. They can be found under the columns
"Scatter1" to "ScatterN", to the right of the data, and can be selected on their own by deselecting the box labelled "Display Data". The PCA scores can also be copied to the Clipboard by clicking "Copy". ClustanPCA Limitations
There are no fixed restrictions to ClustanPCA, but when the number of variables is large the analysis may take some time to run or may run out of internal memory. As a rough guide, ClustanPCA requires about 10p^{2}
bytes of memory, where p is the number of variables; hence if there are 100 variables, it will require 100k of internal
storage. A data matrix of 1000 variables will require 10MB of storage and take some time to complete a PCA analysis. The number of cases is not a major limitation for ClustanPCA.
ClustanPCA is distributed with ClustanGraphics 7.04 onwards. Clustan  A Class Act © 1998 Clustan Ltd.
