Astronomers will be familiar with the Hertzsprung-Russell diagram above, which shows visual stars plotted by their temperature and luminosity. They called the stars in the main straggly sequence "dwarfs" and the stars in the clump at the top right "giants". Sub-divisions to either side of these sequences are referred to as "super-giants", "sub-giants", "sub-dwarfs" and "white dwarfs". The H-R diagram provided the initial focus for theories of stellar evolution throughout the 20th century.
Ak-means analysis of the H-R diagram clusters closely associated points together and consigns "outliers" to a residue set of unclassified cases. When optimizing the Euclidean Sum of Squares we obtain a "tessellation" of roughly spherically-shaped clusters covering the densest part of the scatter, which looks rather like this:
The two principal sequences are thus mapped by about 40 dense, compact clusters. They are not actually spherical but are spherically-shaped convex polyhedra - we used circles in the diagram to illustrate the cluster tessellation. Note that there are many points outside the tessellation - these are the outliers by k-means, an important reason for specifying an outlier threshold so as not to distort the shape of the main densely-populated clusters.
The tessellation now "describes" the main sequences quite accurately, and the two principal clusters ("giants" and "dwarfs") can be readily identified and separated using Single Linkage (see reference paper for full details).
We are now in a position to separate the two main cluster sequences and carry out further statistical analyses on them. For example, we could calculate linear regressions on each cluster subset separately, as illustrated above, to identify the sub-structure and the inter-correlations between the variables for the giant and dwarf star sequences.
By comparison, a decision tree obtained by splitting first on Luminosity (cut value A) and second by Temperature (cut value B), or vice-versa, achieves little more than crudely splitting the scatter into 4 quarter segments. The decision tree looks something like this:
It's true that the decision tree provides a simple key to identifying 4 groups of stars according to 4 quarter segments of the H-R diagram. But in terms of stellar evolution this is crudely simplistic. It certainly doesn't find the main dwarf star sequence as a single segment. The cut values are, by definition, orthogonal to the co-ordinate axes. Any shape that the clusters may have - such as arises where variables are correlated, as is evident in the dwarf star sequence - can thus go undetected by a decision tree.
The cluster analysis approach is multivariate, by definition, whereas the decision tree is univariate at each split. In other words, clusters are formed by cluster analysis in terms of associations between all the active variables, not by splitting at each node on a single variable as in a decision tree.
It's little wonder that data miners resort to "boosting", "bagging" and cross-validation of different samples to try and find a "consensus" decision tree. If the initial split at the first node cuts through a natural cluster (e.g. at luminosity = A) there's no hope of recovering the shape of the split cluster by a "top down" approach.
The reason data miners haven't used hierarchical cluster analysis to any substantive extent is that they don't have an efficient method that can handle thousands of cases. But with Clustan, it's available now. To find out how, seeclustering large datasets. Another factor is the ability to handle mixed data types - most clustering methods don't offer this flexibility. But at Clustan, we recognize the need to deal with real-world problems. Why should you have to force categorical data to behave like continuous variables, or categorize continuous variables to fit a c2 (chi-squared) analytical framework. We aim to represent your data exactly as you present it to the program, and compile clusters without warping your data to fit the method. There's a good case study in banking.
How we did it
This paper was first presented at Interface '98 and has been published in: Wishart, D (1998), Efficient hierarchical cluster analysis for data mining and knowledge discovery, Computing Science and Statistics, 30 , 257-264. To download the paper as a zip file (65k), click here.