k-Means Analysis 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us

As you must surely know by now, a unique ClustanGraphics feature is its ability to complete hierarchical cluster analysis on very large datasets by clustering directly on your data.  In the following example, we clustered a large sample of 16,000 cases by Ward's Method in under 3 minutes, using a basic PC.  For an example of clustering a million cases from a random start, see k-means analysis in data mining.

However, a tree this large may only be partially optimal. Using k-means analysis in ClustanGraphics you can further optimize a cluster model and thus clarify your classification profile.  In this example, we ran k-means using the 40-cluster tree partition from Cluster Data as initial cluster centres.

We opted to tighten the 40 clusters by removing outliers.  These are cases remote from the clusters - in this example, cases at distances of more than 3.00 from any cluster centre.  Of the 16,000 cases in the original sample 1,203 outliers were removed, thus leaving 14,797 cases grouped into 40 tight clusters each having a radius (in 5 dimensions) of at most 3.

ClustanGraphics k-means converged in 30 iterations, taking just 8 seconds on a Pentium II PC.  Cluster exemplars were computed - the 40 cases nearest to each cluster centre.  They can be extracted, if you wish, as specimens which exemplify the full diversity of your data.  Just select exemplars as your final cluster centres before truncation.

Click k-Means Statistics, and the following choices are offered.  They allow you to check the size, location (in 5 dimensions) and exemplar of each cluster, its contribution to the Euclidean Sum of Squares and how it compares with the other clusters.

With t-tests on variables, you can easily explore the contributions of your variables to each cluster profile.  These illustrate large positive or negative departures from the sample mean on key discriminating variables.

You can also examine the classification of each case individually - the cluster to which it has been assigned, the distance from its centre, or its distances from all of the clusters.  This may be useful, for example, to sort all the cases according to their nearest cluster distances, or to investigate outliers - sometimes the rare nuggets are more interesting than the bulk data obscuring them (c.f. mining fraudulent insurance claims).

All k-means results can be displayed, output to a file, copied to a document for publication or copied to a spreadsheet for further analysis.

Next we truncated our model to the 40 final clusters found by k-means.  This had the beneficial effect of substantially reducing the size of the model - from 16,000 cases to 40 clusters - making it much more amenable to publication (see below).


A proximity matrix of Euclidean Sum of Squares was computed between clusters, and a summary tree obtained using Increase in Sum of Squares (Ward's Method), as shown below.  This allows any cluster level from 2 to 40 to be studied in detail, using the full panoply of ClustanGraphics features applicable to small datasets.  In the example, the 5-cluster level has been highlighted and the clusters are labelled by their exemplars and sizes [in brackets].  You can, of course, substitute more descriptive labels when you profile your classification or publish it.   [Reflect again!]

We need to mention that our k-means procedure can be used on very large datasets, without first calculating a tree.  For an example, see k-means analysis in data mining where we found a 25-cluster model for 1 million cases by 12 variables. This size of dataset is not amenable to hierarchical cluster analysis, even using our super-fast direct data clustering method.

With very large datasets, it's necessary to specify the initial cluster centres in other ways.  ClustanGraphics lets you choose from k initial cases; a random assignment of all the cases to k initial clusters; reading k cluster centres from a file; or pasting cluster centres from another application, such as a spreadsheet.

Note that, unlike some k-means programs, our algorithm is certain to converge.  For details, see the technical explanation of why we implemented an exact relocation test for the Euclidean Sum of Squares.  This can be very important when dealing with very large datasets.

Of course, you can also apply your k-means model to classifying new data.  Just save it, and run Classify .   This allows an unlimited number of new cases to be compared with a cluster model and assigned to their nearest clusters, at any level in the final tree.

Don't forget that our k-means analysis offers additional functionality not available in other software.  For example, there's a choice of 4 criterion functions to optimize.  If you wish, you can assign differential weights to your variables; or differential weights to your cases, e.g. in stratified samples.  Most important, k-means handles incomplete data; and because cluster statistics are computed from the complete observations on each variable in each cluster, our k-means analysis derives more accurate estimates of the criterion function than can be obtained by, for example, imputing missing values as a prior analytical step.  Finally, where else can you cluster a million cases in minutes, on a PC?

You can distribute your model to co-workers - just upload it to a website, or attach it to your e-mail messages.  It's compact and self-contained, with all the truncated model data, case labelling and transformation parameters they need to run your model on their own data.  This presupposes, of course, that they too have ClustanGraphics.

This methodology was presented to the International Statistical Institute in Helsinki, 1999.  The abstract of our paper can be downloaded here.  Our k-means analysis methodology was further developed to allow for mixed data types , now available with ClustanGraphics5.  You can combine binary, nominal, ordinal and continuous variables in a k-means analysis, with incomplete data, variable transformations and differential case or variable weighting.  This unique k-means method, which is essential in data mining of databases and analysing survey questionnaires, will be presented at GfKl 2001 in Munich, 14-16 March 2001.

To find out more, you and your co-workers will just have to become users - so ORDER ClustanGraphics on-line now.