This section discusses aspects of proximity analysis, or the measurement of the proximity between cases and between clusters.
If you are going to attempt a cluster analysis on data, then you should address the issue of what proximity measure to use at an early stage. Questions to be resolved include:
What are the variables' types?
How to represent any missing values?
Should the data be transformed?
How to compute the proximity between cases?
How to measure proximity between clusters?
What criterion should be optimized when clustering?
Types of VariablesMammals Case Study
, the variables are the percentages of Water, Protein, Fat, Lactose and Trace Elements (Ash) in the milk of different mammals. More complex data types are possible in ClustanGraphics, where the variables can be
any combination of binary (presence/absence), nominal, ordinal and continuous. If your data has this complexity, click here.
In a simple cluster analysis, all the variables would be on a continuous scale. For example, in the
Missing Valueshere for details.
The effect of missing values is to ignore any comparison between two cases or two clusters where either one value or both values are missing.
InWith ClustanGraphics you can handle an incomplete data matrix or proximity matrix. Click
If you have ordinal or continuous variables, you may wish to transform your data, for example by dividing each variable by its range of values or standard deviation. Details of data transformations
The next decision is how to compute the proximities between cases. ClustanGraphics provides a range of
proximity measures, which differ according to whether your variables are all continuous, all binary, or mixed. For a worked example, and a list of current proximity measures, click
Reading ProximitiesProteins Case Study,
where the distance between two species is the number of positions in the protein cytochrome-c molecule where the proteins for the two species have different amino acids.variables. Click here to find out how to read proximities.
It may be that your input observations are themselves proximities. An example is the
Another example is where you may wish to cluster your variables, and the input is a correlation matrix between your variables. Click here for a discussion on clustering variables.
Proximities Between Clusters
Your choice of clustering method or clustering criterion will determine the way in which the proximity between two clusters is measured. For example, using single linkage, the proximity between two clusters is the highest
similarity (or smallest distance) between any two cases, one from each of the clusters. It's their nearest neighbours.
By contrast, with average linkage, the similarity between two clusters is the average of the proximities between all pairs of cases, one from each of the two clusters.
We often recommend optimizing the Euclidean Sum of Squares. This involves finding the mean of each cluster and the distance from each case contained in each cluster and its mean, then squaring these distances and
summing the squared distances for all the cases in all the clusters. It is a measure of the within-cluster variance, or the diversity of a particular classification or cluster model.
If we obtain a small value for the Euclidean Sum of Squares, then the cases within the clusters are all tightly grouped around the cluster means, so each cluster mean is representative of all the cases it contains. For a
definition of the Euclidean Sum of Squares and how we optimize it in hierarchical cluster analysis or k-means analysis, click here.
Don't forget that if you have mixed data types
, ClustanGraphics computes general distances
statistics without warping your data to fit the method.
Please refer to the ClustanGraphics Primer for definitions. Further details about computing proximity coefficients can be found here.