Supposing that the data have been transformed to standard scores, or z-scores, then a typical selection of five cases taken from the Mammals Case Study is as follows:
Standardization to z-scores has the effect that each column has a mean of zero and standard deviation of 1. See Data Transformations for further details. The next step is to calculate proximities between all pairs of cases. In ClustanGraphics it's simpe - just click Compute on the Proximities menu, and select a proximity coefficient from the drop-down list: In this example, we have selected Squared Euclidean Distance. The result is a proximity matrix of order 25
Just by examining the proximities we can determine interesting details, such as the fact that donkey and zebra are quite similar, with a squared distance value of 0.186, whereas donkey and seal are the two most dissimilar cases, with a squared distance of 8.731. Of course it's not very practicable to examine all 288 proximities by inspecting the proximity matrix for 25 cases; and it certainly would not be practicable with 10,000 cases. However, we can use Nearest Neighbour analysis to find the nearest neighbours; and of course, clustering the proximity matrix is the main way we can group the cases into clusters and thus describe the structure and diversity of the data. See the ClustanGraphics Preview, where the Mammals Case Study is taken further.
variables are as
follows:
continuous
Squared Euclidean Distance
variables, you can select from the following binary proximity coefficients:
binary
Binary Euclidean Distance (B+C)/M These coefficients compare any two cases i and j across all M unmasked binary variables, as follows:
If a variable is "missing" for either case i or case j, then it is not considered for the computation of the coefficient. In this case M is the number of variables that are
Binary Euclidean Distance (B+C)/M is a coefficients.
Use Binary Euclidean Distance if you intend to cluster by minimizing the Euclidean Sum of Squares (Ward's Method). Details here
similarity
, you can compute a proximity matrix from the following measures of proximity:
mixed data types
Squared Euclidean Distance There is no program limit to the size of proximity matrix which can be computed; the limit is determined by the memory and disk resources available on the user's PC. As a rough guide only, a reasonable Pentium PC is capable of computing proximities for up to about 10,000 cases with ClustanGraphics. If you have a larger data matrix, we recommend that you use Direct Data Clustering which can produce a hierarchical cluster analysis for 100,000 cases, or more. Note that ClustanGraphics can compute proximities from incomplete data - see missing values for details. For further definitions and other details, please refer to the ClustanGraphics Primer and the ClustanGraphics Help file. |