Na´ve k-Means 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us

Did you ever see a Professional Golf Association (PGA) championship where the players stop when their ball gets within 2 feet of the hole?

How about a 100m race where they decide who has won at 98 metres?

Or the concert conductor who stops his orchestra 20 bars from the end of the score?

These may be somewhat absurd analogues, but they do accurately reflect the way many "na´ve" k-means programs work - they stop when they get within 2% of their "goal", and their "goal" is not even a criterion function to be minimized.  For example, here is the "convergence" criterion for one widely available k-means program (see if you can guess the product from the syntax):

SAS kmeans convergence criterion terminates when all cluster means cease moving by less than 2% of their starting distances

Now, ask yourself why should this test be necessary .... it's because they won't necessarily converge without such ad hockery !  We call these procedures "na´ve k-means" because the relocation test does not relate to any convergence criterion and hence does not guarantee convergence, particularly with large data sets such as social science surveys.

Standard software packages have in fact been running "na´ve k-means" for decades, and they don't actually optimize anything.  A case i is moved from a cluster p to another cluster q if  dip > diq.  This na´ve test only considers the distance of case i to the means of clusters p and q before the move occurs.  It takes no account of the effect of the move on the other cases in clusters p and q, nor to the changed values of the means of p and q after the move has occurred.

As we have explained elsewhere, the na´ve test can actually result in an increase in the Euclidean Sum of Square or trace (w ) for clusters p and q, so the procedure can continue to run on indefinitely without ever achieving convergence.  Hence the originators had to invent an ad hoc stopping rule, as instanced above, because it doesn't look good if their procedures always finish on the maximum number of iterations specified by the user.  This crucially means the resulting cluster solutions are indeterminate where true convergence is not achieved.  The final value of a convergence criterion, such as E, is never reported or available for comparison with other solutions.

How did we discover this simple fact?  We programmed na´ve k-means and were puzzled when it failed to converge with large data sets.  Then we checked the changes in E after each iteration and found that E was actually increasing at the end of some iterations.  Furthermore, we noticed that in some iterations there were small numbers of cases being relocated, whereas in later iterations the number of moves were larger.  So, stopping the procedure when the changes in the means has become small does not necessarily mean that the reported solution is anywhere near an optimum value for E.  These programs simply hand you an arbitrary classification and say "here's a result, it's not optimum but it's the best we can do"!

Returning to our opening analogy, we never discover who actually won the 100m sprint because the race stopped at 98m.  Try that at the next Olympic Games.

It also means that, unlike our implementation of k-means, they cannot count the frequency of each final cluster solution to test it's reproducibility because they never finally converge.

You can read our most recent white paper on this, presented at the ISA in August 2004, here (pdf file).

To find out more and test exact k-means with your data, ORDER ClustanGraphics on-line now.