The Euclidean Sum of Squares (ESS) E The total ESS for all clusters p is thus E = S I In ClustanGraphics we have implemented k-means to minimize the total Euclidean
Sum of Squares E. This is equivalent to minimizing the sum of the weighted squared Euclidean distances d Other k-means programs relocate any case to the cluster with the nearest mean. Given the above definitions, this simplistic approach may appear to minimize E. However, with large datasets it does not necessarily converge quickly, or at all, because such relocations may not actually reduce E.
To minimize E we must only relocate a case i from cluster p to cluster q when E
We call this the Put another way, relocating a case i from cluster p to cluster q pulls the mean of q towards it and pushes the mean of p away from it. This can cause the distances from the mean of other cases in clusters p and q to increase, such that E is increased. With large datasets, an oscillation of boundary cases between two or more clusters can result in successive iterations. Since E is a sum of squares, relocating only those cases which yield a reduction in E must converge because E cannot be indefinitely reduced.
How did we discover this discrepancy? When we programmed the simplistic k-means algorithm we were surprised that it failed to converge with very large datasets. Our data mining example involving k-means on a million
cases was still running after 750 iterations, with oscillating values of E. But when we implemented the What does this mean to the user? Firstly, you can be confident that our k-means analysis will converge, if allowed enough iterations. Each iteration reduces the ESS, and the reduction in ESS can be monitored. Secondly, our k-means analysis generally requires fewer iterations than other programs to reach a stable minimum ESS. So we don't need to resort to special convergence criteria, such as specifying a maximum percentage movement in cluster centres to prevent indefinite oscillation. Thirdly, the use of differential weights provides extra flexibility. For example, in a customer segmentation study you can weight customers by their transaction volumes, thus emphasising those who contribute most to business turnover. Such case weighting also provides a surrogate cluster analysis of individual business transactions. Our k-means analysis is just another example of cool programming which sets ClustanGraphics apart from the competition. To order ClustanGraphics on-line click |