Some of our users have expressed surprise and concern that k-means clustering can produce different classifications from the same data if the starting conditions are varied. They point to the fact that this is not evidently the case with other software, and hence query the added complexity involved in deciding which cluster solution to choose from a range of different solutions. If you are currently using other k-means software, we invite you to check your specifications for a discussion of the following important issues: - Different cluster solutions
- Criterion values
- Outlier deletion
- Failure to converge
- Case order dependence
- Completely randomised trials
The chances are that few if any of the above issues will be discussed. Most of our
competitors present one classification for a given dataset and number of clusters, as if it is the definitive, or
- Choose an initial partition of the cases into k clusters (this can be a tree section, k selected cases, or a random assignment to k clusters).
- In one iteration, consider all of the cases in relation to each of the clusters.
- For each case, compute the change in the Euclidean Sum of Squares if the case is moved from its present cluster to any other cluster.
- Move any case for which the change at step 3 is negative, i.e. the change to a different cluster
the Euclidean Sum of Squares.*reduces* - Re-compute the cluster means following any change of cluster membership at step 4.
- Repeat steps 3-5 until no further changes of cluster membership occur in one complete iteration. The procedure has converged to a stable cluster solution.
The cluster means are re-computed at step 5 when any change in cluster membership occurs. We call this "migrating" or "drifting" means, and it will be evident that the final cluster solution will be dependent to some extent on the order in which the cases are considered for relocation (see below). With ClustanGraphics you can choose to randomize the case order, so that different cluster solutions will typically be obtained from different k-means runs. Our FocalPoint procedure keeps note of the different solutions obtained, and reports those with the smallest Euclidean Sum of Squares - the "top" solutions.
The letters A and B (contoured in red and blue, respectively) denote the cluster to which each point has been assigned, and the crosses denote the cluster means where a cluster has 2 or 3 points. It should be evident, from inspection, that in all six cluster solutions each case is closer to the mean of the cluster to which it is assigned than to the mean of the other cluster. A little high school mathematics can readily prove the point. However, the two solutions at the left have a smaller Euclidean Sum of Squares than the four solutions at the
right - to be exact, if the square has a side of length 1, the Euclidean Sum of Squares is 1 for the two clusters of 2 points, and 1.33 for a triangular cluster of 3 points. Hence the two solutions at the left represent stable solutions. Again, a little high school math can prove the point.sub-optimalThe moral of this exercise is - don't trust a k-means procedure that gives you only one cluster solution. You need to explore the possible alternatives in more depth.
For example, when FocalPoint is used with the Mammals case study, 7 different cluster solutions are obtained from 500 random trials with ESS values and cluster sizes as follows:
Solution 1 has the smallest ESS value and reassuringly occurs in two-thirds of the random trials. This is the cluster solution that we would normally use to characterize mammals' milk. However, it is interesting to note that solutions 2 and 4 assign elephant to a singelton cluster, and solutions 6 and 7 assign dolphin and seal to singleton clusters. This section therefore illustrates the importance of investigating the criterion function values for the classifications, to establish which of several cluster solutions is most appropriate to use as a model, and also the influence of outliers on a k-means classification. If you are using one of our competitors' k-means clustering tools, try running it with different starting configurations (if you can) and check the criterion function values (if they are reported) - then ask your vendor's representative to explain the differences.
By comparison, if every point is forced to join a cluster the classification looks like this: The clusters are typically larger when outliers are not removed, and the resulting cluster solution is more confused. We generally advise that outliers should be deleted when clustering a large dataset with ClustanGraphics, thus placing all remote or atypical cases in a residue subset of unclassified cases. When the cluster model has been saved, the distance from each outlier to its nearest cluster centre can be obtained and this can be used to classify the outliers, if desired.
Tests show that the number of cases moved in each k-means iteration does not necessarily decline sequentially, and therefore this ad hoc stopping rule may actually prevent the procedure from even getting close to the global optimum solution. With ClustanGraphics, we do not provide such for the Euclidean Sum of Squares, by which a case is moved from one cluster to another exact relocation test if the move
reduces the Euclidean Sum of Squares, only computed. Since the Euclidean Sum of Squares is a sum of squared distances about k cluster means, it cannot be reduced indefinitely and is bounded by zero, hence our
exact k-means procedure must converge. Furthermore, we assert that our procedure achieves greater accuracy in its assignment decisions and is more likely to find the global optimum solution than those offered
by our competitors. Details here.exactly
the initial k clusters and the case order. With FocalPoint
, you can run any number of randomized trials on a dataset, and the procedure will record the top solutions, their frequency, reproducibility and ESS criterion values. An example for the Mammals Case Study
involving 500 randomized trials was given earlier.both
lucky. Please reflect on this analogy! As discussed above, it is
possible for different stable k-means cluster solutions to be obtained under different starting conditions or simply by changing the case order, so you need to explore a range of classifications produced by the
procedure and examine their criterion values.veryIf you wish to explore this further, please download a paper that we presented at the German Classification Society in March, 2001 - here. and a more recent presentation at the International Sociological Association's sixth conference on Logic and Methodology (RC33) in August, 2004 - here (pdf format). Thank you for your interest in the ClustanGraphics k-means procedure. |