Hierarchical Cluster Analysis 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us
Reading Data
Cluster Keys
Classify Cases
Auto Script
Hierarchical cluster analysis is a statistical method for finding relatively homogeneous clusters of cases based on measured characteristics.  It starts with each case in a separate cluster and then combines the clusters sequentially, reducing the number of clusters at each step until only one cluster is left.  When there are N cases, this involves N-1 clustering steps, or fusions.

This hierarchical clustering process can be represented as a tree, or dendrogram, where each step in the clustering process is illustrated by a join of the tree; for example:


The horizontal scale corresponds to the fusion values obtained from the hierarchical cluster analysis.  This example illustrates cluster A and cluster B being combined at the fusion value of x, and the horizontal axis of the tree reflects the fusion values {x} of all the fusions, drawn to scale.

An example from our Mammals Case Study is shown below, and this is featured throughout the ClustanGraphics Primer.  The cases underlined are cluster exemplars, or the most typical members of each cluster, the 5-cluster level having been selected.

MammalsTree.gif (4503 bytes) 

ClustanGraphics provides 11 methods of hierarchical cluster analysis: 

    Single Linkage (or Minimum Method, Nearest Neighbor)
    Complete Linkage (or Maximum Method, Furthest Neighbor)
    Average Linkage (UPGMA)
    Weighted Average Linkage (WPGMA)
    Mean Proximity
    Centroid (UPGMC)
    Median (WPGMC)
    Increase in Sum of Squares (Ward's Method)
    Sum of Squares
    Flexible ( space distortion parameter)
    Density (or k-linkage, density-seeking mode analysis)

The methods are fully described in the ClustanGraphics Primer, which is supplied with ClustanGraphics.  To use these methods you need to compute a proximity matrix such as squared Euclidean distances, Pearson product-moment correlations or Jaccard similarity coefficients.  This can be done in ClustanGraphics, for up to about 10,000 cases.

Alternatively, you can read a similarity or dissimilarity matrix computed or observed elsewhere - an example is given in our Proteins Case Study.

It is also possible to display a tree for a hierarchical cluster analysis produced by other packages.  In this case the cluster analysis results are read using File|Open of ClustanGraphics, which can accept and display trees of 120,000 cases, or more.

You could also copy a tree from a book or article, measure the coefficient values for each fusion and paste the results into ClustanGraphics using the File|Open dialogue. There are Word and Excel examples distributed with ClustanGraphics which show how this can be done.

If you have a very large survey or corporate database to cluster, as in data mining applications, calculating a proximity matrix may not be practicable.  One option is to use direct data clustering in ClustanGraphics, which can produce a hierarchical cluster analysis for 120,000 cases or more using Ward's Method or Average Distance (UPGMA).

Alternatively, consider using k-means analysis on large datasets.  You can use ClustanGraphics to cluster a million records, then truncate to cluster centres, optimally order the resulting clusters, find cluster exemplars and summarize the relationships between them hierarchically.

If you are involved in data mining or analyzing large social surveys, remember that our k-means analysis and hierarchical cluster analysis can handle different types of variables, such as occur in survey questionnaires and database records.  You are not likely to find similar flexibility in other clustering or neural network software.

If you're using decision trees for segmentation, you might like to have a look at our critique Clustering versus Decision Trees, where we show how the decision tree approach can produce very simplistic segmentation compared with hierarchical cluster analysis.