Proximity Analysis 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
Reading Data
Proximities
Hierarchical
Constraints
k-Means
FocalPoint
Cluster Keys
Profiles
Exemplars
Classify Cases
ClustanPCA
Scatterplots
Auto Script
Wizard
This section discusses aspects of proximity analysis, or the measurement of the proximity between cases and between clusters.

If you are going to attempt a cluster analysis on data, then you should address the issue of what proximity measure to use at an early stage.  Questions to be resolved include:

    What are the variables' types?
    How to represent any missing values?
    Should the data be transformed?
    How to compute the proximity between cases?
    How to measure proximity between clusters?
    What criterion should be optimized when clustering?
     

Types of Variables
In a simple cluster analysis, all the variables would be on a continuous scale.  For example, in the
Mammals Case Study , the variables are the percentages of Water, Protein, Fat, Lactose and Trace Elements (Ash) in the milk of different mammals.  More complex data types are possible in ClustanGraphics, where the variables can be any combination of binary (presence/absence), nominal, ordinal and continuous.  If your data has this complexity, click here.

Missing Values
InWith ClustanGraphics you can handle an incomplete data matrix or proximity matrix.  Click
here for details.  The effect of missing values is to ignore any comparison between two cases or two clusters where either one value or both values are missing.

Data Transformations
If you have ordinal or continuous variables, you may wish to transform your data, for example by dividing each variable by its range of values or standard deviation.  Details of data transformations
here.

Computing Proximities
The next decision is how to compute the proximities between cases.  ClustanGraphics provides a range of proximity measures, which differ according to whether your variables are all continuous, all binary, or mixed.  For a worked example, and a list of current proximity measures, click
here.

Reading Proximities
It may be that your input observations are themselves proximities.  An example is the
Proteins Case Study, where the distance between two species is the number of positions in the protein cytochrome-c molecule where the proteins for the two species have different amino acids.variables.  Click here to find out how to read proximities.

Another example is where you may wish to cluster your variables, and the input is a correlation matrix between your variables.  Click here for a discussion on clustering variables.

Proximities Between Clusters
Your choice of clustering method or clustering criterion will determine the way in which the proximity between two clusters is measured.  For example, using single linkage, the proximity between two clusters is the highest similarity (or smallest distance) between any two cases, one from each of the clusters.  It's their nearest neighbours.

By contrast, with average linkage, the similarity between two clusters is the average of the proximities between all pairs of cases, one from each of the two clusters.

We often recommend optimizing the Euclidean Sum of Squares.  This involves finding the mean of each cluster and the distance from each case contained in each cluster and its mean, then squaring these distances and summing the squared distances for all the cases in all the clusters.  It is a measure of the within-cluster variance, or the diversity of a particular classification or cluster model.

If we obtain a small value for the Euclidean Sum of Squares, then the cases within the clusters are all tightly grouped around the cluster means, so each cluster mean is representative of all the cases it contains.  For a definition of the Euclidean Sum of Squares and how we optimize it in hierarchical cluster analysis or k-means analysis, click here

Don't forget that if you have mixed data types , ClustanGraphics computes general distances and cluster statistics without warping your data to fit the method.

Please refer to the ClustanGraphics Primer for definitions.  Further details about computing proximity coefficients can be found here.