k-Means Technical Critique 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us

Some of our users have expressed surprise and concern that k-means clustering can produce different classifications from the same data if the starting conditions are varied.  They point to the fact that this is not evidently the case with other software, and hence query the added complexity involved in deciding which cluster solution to choose from a range of different solutions.

If you are currently using other k-means software, we invite you to check your specifications for a discussion of the following important issues:

The chances are that few if any of the above issues will be discussed.  Most of our competitors present one classification for a given dataset and number of clusters, as if it is the definitive, or best, cluster solution that can be found by the k-means procedure.  However, this is unlikely to be the case.

k-Means Procedure
The k-means clustering procedure is implemented in ClustanGraphics as follows:

    1. Choose an initial partition of the cases into k clusters (this can be a tree section, k selected cases, or a random assignment to k clusters).
    2. In one iteration, consider all of the cases in relation to each of the clusters.
    3. For each case, compute the change in the Euclidean Sum of Squares if the case is moved from its present cluster to any other cluster.
    4. Move any case for which the change at step 3 is negative, i.e. the change to a different cluster reduces the Euclidean Sum of Squares.
    5. Re-compute the cluster means following any change of cluster membership at step 4.
    6. Repeat steps 3-5 until no further changes of cluster membership occur in one complete iteration.  The procedure has converged to a stable cluster solution.

The cluster means are re-computed at step 5 when any change in cluster membership occurs.  We call this "migrating" or "drifting" means, and it will be evident that the final cluster solution will be dependent to some extent on the order in which the cases are considered for relocation (see below).

With ClustanGraphics you can choose to randomize the case order, so that different cluster solutions will typically be obtained from different k-means runs.  Our FocalPoint procedure keeps note of the different solutions obtained, and reports those with the smallest Euclidean Sum of Squares - the "top" solutions.

Different Cluster Solutions
The fact that it is very easy to obtain several stable cluster solutions that are different can be illustrated by the following diagram.   The data consist of 4 points arranged at the vertices of a square, and the k-means clustering procedure is run for k=2 clusters.  There are six different "stable" cluster solutions, as shown below.

Six different stable cluster solutions for 4 points

The letters A and B (contoured in red and blue, respectively) denote the cluster to which each point has been assigned, and the crosses denote the cluster means where a cluster has 2 or 3 points.  It should be evident, from inspection, that in all six cluster solutions each case is closer to the mean of the cluster to which it is assigned than to the mean of the other cluster.  A little high school mathematics can readily prove the point.

However, the two solutions at the left have a smaller Euclidean Sum of Squares than the four solutions at the right - to be exact, if the square has a side of length 1, the Euclidean Sum of Squares is 1 for the two clusters of 2 points, and 1.33 for a triangular cluster of 3 points.  Hence the two solutions at the left represent global optimum stable cluster solutions, while the four solutions to the right are sub-optimal stable solutions.  Again, a little high school math can prove the point.

The moral of this exercise is - don't trust a k-means procedure that gives you only one cluster solution.  You need to explore the possible alternatives in more depth.

Criterion Values
Varying the starting conditions can produce different stable cluster solutions, as illustrated above.  This prompts us to ask which is the best of several cluster solutions obtained for the same data?  This can be partly answered by examining the value of the clustering criterion being minimised.  For the vertices of a square example, we found that the 2-2 solutions have a lower ESS than the 3-1 solutions.  This is often replicated in real applications. 

For example, when FocalPoint is used with the Mammals case study, 7 different cluster solutions are obtained from 500 random trials with ESS values and cluster sizes as follows:

Solution

Frequency

ESS

Cluster sizes

1

336

2.58086

8-5-6-4-2

2

49

2.80894

10-2-7-5-1

3

4

2.92945

10-4-7-2-2

4

101

2.94606

9-1-7-2-6

5

2

2.95241

6-4-11-2-2

6

3

3.38873

10-8-1-5-1

7

5

3.45002

11-1-6-6-1

Solution 1 has the smallest ESS value and reassuringly occurs in two-thirds of the random trials.  This is the cluster solution that we would normally use to characterize mammals' milk.  However, it is interesting to note that solutions 2 and 4 assign elephant to a singelton cluster, and solutions 6 and 7 assign dolphin and seal to singleton clusters.

This section therefore illustrates the importance of investigating the criterion function values for the classifications, to establish which of several cluster solutions is most appropriate to use as a model, and also the influence of outliers on a k-means classification.  If you are using one of our competitors' k-means clustering tools, try running it with different starting configurations (if you can) and check the criterion function values (if they are reported) - then ask your vendor's representative to explain the differences.

Outlier Deletion
Many of our competitors make no provision for deleting outliers in their k-means clustering procedures.  Every case is forced to join a cluster no matter how atypical or remote it might be, with the result that the classification can be substantially distorted.  The following example illustrates the point.  The cluster solution for the H-R classification of visual stars describes all the points enveloped by a tessellation of about 40 tight, spherically-shaped clusters.  Any points lying outside a red circle are treated as unclassified outliers.  It is thus relatively easy to identify the main dwarf and giant star sequences from the cluster solution - for details of the final model click
here.

By comparison, if every point is forced to join a cluster the classification looks like this:

The clusters are typically larger when outliers are not removed, and the resulting cluster solution is more confused.  We generally advise that outliers should be deleted when clustering a large dataset with ClustanGraphics, thus placing all remote or atypical cases in a residue subset of unclassified cases.  When the cluster model has been saved, the distance from each outlier to its nearest cluster centre can be obtained and this can be used to classify the outliers, if desired.

Failure To Converge
Other software packages implement na´ve k-means on the basis that simple Euclidean distances are computed between cases and cluster means, in which case it is quite possible for the procedure to oscillate indefinitely between two or more partitions and never converge.  This defect was recognised by the originator of the k-means method (MacQueen, Berkeley Symposium, 1967, p. 288).  Our competitors employ a "convergence criterion" by which their procedures terminate when a complete iteration fails to move a cluster centre by more than a percentage of the smallest distance between any of the centres.  A detailed critique of na´ve k-means can be found here.

Tests show that the number of cases moved in each k-means iteration does not necessarily decline sequentially, and therefore this ad hoc stopping rule may actually prevent the procedure from even getting close to the global optimum solution.

With ClustanGraphics, we do not provide such ad hoc convergence criteria, because our implementation of k-means is guaranteed to converge and they are therefore not needed.  We compute an exact relocation test for the Euclidean Sum of Squares, by which a case is moved from one cluster to another only if the move reduces the Euclidean Sum of Squares, exactly computed.  Since the Euclidean Sum of Squares is a sum of squared distances about k cluster means, it cannot be reduced indefinitely and is bounded by zero, hence our exact k-means procedure must converge.  Furthermore, we assert that our procedure achieves greater accuracy in its assignment decisions and is more likely to find the global optimum solution than those offered by our competitors.  Details here.

Case Order Dependence
As previously noted,
cluster means are re-computed at step 5 of our exact k-means procedure after each change in cluster membership occurs.  We call this "migrating" or "drifting" means, and it will be evident that the final cluster solution is dependent to some extent on the order in which the cases are considered for relocation.  With ClustanGraphics you can randomize the case order, so that different cluster solutions will typically be obtained from different k-means runs.  Our FocalPoint procedure keeps note of the different solutions obtained, and reports those with the smallest Euclidean Sum of Squares - the "top" solutions.

Completely Randomized Trials
As discussed above our k-means procedure can be started by assigning the cases to k initial clusters at random.  In addition, we have also seen that the cluster solution can be influenced by the order of the cases.  A completely randomized trial therefore involves randomizing both the initial k clusters and the case order.  With FocalPoint , you can run any number of randomized trials on a dataset, and the procedure will record the top solutions, their frequency, reproducibility and ESS criterion values.  An example for the Mammals Case Study involving 500 randomized trials was given earlier.

Conclusion
If your present clustering procedure does not report the criterion function being optimized, fails to offer an option to randomize the case order, it's equivalent to firing a blunderbuss blindfold with a single grain of shot - you may hit the target, but only if you're very lucky.  Please reflect on this analogy!  As discussed above, it is possible for different stable k-means cluster solutions to be obtained under different starting conditions or simply by changing the case order, so you need to explore a range of classifications produced by the procedure and examine their criterion values.

If you wish to explore this further, please download a paper that we presented at the German Classification Society in March, 2001 - here. and a more recent presentation at the International Sociological Association's sixth conference on Logic and Methodology (RC33) in August, 2004 - here (pdf format).

Thank you for your interest in the ClustanGraphics k-means procedure.