Clustering Versus Decision Trees 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
We are often asked why we favour the "bottom up" hierarchical clustering approach to "top down" splitting as used in decision tree segmentation.  We therefore offer this shortened critique of decision trees, to stimulate discussion.  For the full published article, see the reference at the end.

Astronomers will be familiar with the Hertzsprung-Russell diagram above, which shows visual stars plotted by their temperature and luminosity.  They called the stars in the main straggly sequence "dwarfs" and the stars in the clump at the top right "giants".  Sub-divisions to either side of these sequences are referred to as "super-giants", "sub-giants", "sub-dwarfs" and "white dwarfs".  The H-R diagram provided the initial focus for theories of stellar evolution throughout the 20th century.

A k-means analysis of the H-R diagram clusters closely associated points together and consigns "outliers" to a residue set of unclassified cases.  When optimizing the Euclidean Sum of Squares we obtain a "tessellation" of roughly spherically-shaped clusters covering the densest part of the scatter, which looks rather like this:

The two principal sequences are thus mapped by about 40 dense, compact clusters.  They are not actually spherical but are spherically-shaped convex polyhedra - we used circles in the diagram to illustrate the cluster tessellation.  Note that there are many points outside the tessellation - these are the outliers by k-means, an important reason for specifying an outlier threshold so as not to distort the shape of the main densely-populated clusters.

The tessellation now "describes" the main sequences quite accurately, and the two principal clusters ("giants" and "dwarfs") can be readily identified and separated using Single Linkage (see reference paper for full details).

We are now in a position to separate the two main cluster sequences and carry out further statistical analyses on them.  For example, we could calculate linear regressions on each cluster subset separately, as illustrated above, to identify the sub-structure and the inter-correlations between the variables for the giant and dwarf star sequences.

By comparison, a decision tree obtained by splitting first on Luminosity (cut value A) and second by Temperature (cut value B), or vice-versa, achieves little more than crudely splitting the scatter into 4 quarter segments.  The decision tree looks something like this:

It's true that the decision tree provides a simple key to identifying 4 groups of stars according to 4 quarter segments of the H-R diagram.   But in terms of stellar evolution this is crudely simplistic.  It certainly doesn't find the main dwarf star sequence as a single segment.  The cut values are, by definition, orthogonal to the co-ordinate axes.  Any shape that the clusters may have - such as arises where variables are correlated, as is evident in the dwarf star sequence - can thus go undetected by a decision tree.

The cluster analysis approach is multivariate, by definition, whereas the decision tree is univariate at each split.  In other words, clusters are formed by cluster analysis in terms of associations between all the active variables, not by splitting at each node on a single variable as in a decision tree.

It's little wonder that data miners resort to "boosting", "bagging" and cross-validation of different samples to try and find a "consensus" decision tree.  If the initial split at the first node cuts through a natural cluster (e.g. at luminosity = A) there's no hope of recovering the shape of the split cluster by a "top down" approach.

The reason data miners haven't used hierarchical cluster analysis to any substantive extent is that they don't have an efficient method that can handle thousands of cases.  But with Clustan, it's available now.  To find out how, see clustering large datasets.  Another factor is the ability to handle mixed data types - most clustering methods don't offer this flexibility.  But at Clustan, we recognize the need to deal with real-world problems.  Why should you have to force categorical data to behave like continuous variables, or categorize continuous variables to fit a c2 (chi-squared) analytical framework.  We aim to represent your data exactly as you present it to the program, and compile clusters without warping your data to fit the method.  There's a good case study in banking.

How we did it
The data for the H-R diagram were read into ClustanGraphics and clustered using k-means analysis with outlier deletion.  The cluster means for the resulting 80 clusters were saved, and a distance matrix was computed on them.   Single Linkage was now used to form a tree which separates the long straggly dwarf star sequence from the giant star sequence.  The tree obtained with ClustanGraphics at this final stage is shown at the right. Regression analysis on the two final clusters could now be carried out separately, using a suitable regression package.

This paper was first presented at Interface '98 and has been published in: Wishart, D (1998), Efficient hierarchical cluster analysis for data mining and knowledge discovery, Computing Science and Statistics, 30 , 257-264.  To download the paper as a zip file (65k), click here.