k-Means Analysis in Data Mining 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us

We were recently asked whether ClustanGraphics could handle a data mining application involving 10 million cases, each with 12 variables. To store a data matrix this large would require 480MB memory, so it is technically feasible on a PC with at least 512MB.

However, we didn't recommend this "sledgehammer" approach but instead proposed the following alternative. First select, say, 10% of the cases to produce a sample data matrix of a million cases by 12 variables.  Even this is too large for most comparable software - ClustanGraphics reads the data in under a minute.

Next, we used k-means analysis to assign the million cases to 25 clusters. The analysis converged in 252 iterations, having made over 3m case moves to find a stable minimum for the Euclidean Sum of Squares (ESS ).  Recall that our exact relocation test for k-means analysis only reassigns a case if the move exactly reduces the Euclidean Sum of Squares. Since the Euclidean Sum of Squares cannot be reduced indefinitely, convergence is fast and assured - unlike other software packages using the simpler test on Euclidean distances which can cause solutions to oscillate.

This was, admittedly, quite a long job taking 60 minutes on a Pentium II PC with 128MB memory - one to run in the background or over lunch.  Once found, the resulting 25-cluster model was saved for further analysis (see below).

kMeansMillion.jpg (39158 bytes)

Next we truncated the data matrix of a million cases to the 25-cluster model, and computed exemplars for the clusters. Then we completed a hierarchical cluster analysis for 25 clusters reducing to 1, using Increase in Sum of Squares (Ward's Method).  We could have used any of 11 clustering methods at this point, but we chose Ward's Method to be consistent with our k-means analysis, which also minimizes the Euclidean Sum of Squares.  Details of how this can be done in one click are given here.

In the k-means tree, the identifier for each cluster is the case ID of its exemplar, with the cluster's size in parentheses. These cluster identifiers can be easily changed to more descriptive labels. 

We have serially ordered the tree, so that the clusters are arranged meaningfully from top to bottom.  Our "upper tail" significance test for the best number of clusters indicated the 5-cluster level which is shaded yellow.  In working with this diagram interactively, we clicked on cluster 5 to obtain some diagnostic information about it, e.g. its size (325,319 cases) its exemplar (case 39500).  We could also select other options at this point, e.g. cluster membership and profiles.

kMeansMillionTree.gif (25075 bytes)

The 25-cluster model can now be saved as a ClustanGraphics file so that it could be used again.  For example, we could now cluster by single linkage, to test whether there are clearly separated elongated clusters in the model.  For a discussion of this approach, see Clustering versus Decision Trees.

We can use any cluster level in our truncated model, from 25 clusters down to 2, as a reference for classifying new cases.  In the following example we used the 4-cluster summary partition as the reference model for Classify Cases .

kMeansMillionClassify.jpg (36364 bytes)

Having tested our cluster model on some sample cases to check it was working (as above), we next ran it against all our data from file.  This puts the cluster of best fit and the associated distance for each case on to an output file for scrutiny or further analysis.  Running our 4-cluster summary model against the database took about 4 minutes to classify a million cases, or 40 minutes for the full 10 million in the database.

kMeansMillionClassifyRun.jpg (24500 bytes)

We could now look at the individual case assignments, though checking 10 million cases would indeed be a mammoth task in itself.  More productively, we could check the k-means cluster statistics which include means for each of the 12 variables, t-tests which identify the extreme values, distances between the cluster means, and so on.  An examination of outliers might be valuable, depending on the application, as rare and distinctly different cases, hitherto concealed by the bulk of the conforming data, are often interesting.

For further details of our unique approach to using cluster analysis in data mining, please download an abstract of a technical paper which we presented in 1999 at the International Statistical Institute, Helsinki (section on Statistical Aspects of Data Mining and Knowledge Discovery in Databases).