Euclidean Sum of Squares 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us

The Euclidean Sum of Squares (ESS) Ep for a cluster p is given by:
                                      Ep
=  Siepci Sj wj (xij - mpj)2
                                                                  Sj wj
where:  xij is the value of variable j in case i within cluster p
                ci is an optional differential weight for case i
                wj is an optional differential weight for variable j
                mpj is the mean of variable j for cluster p

The total ESS for all clusters p is thus E = SpEp and the increase in the Euclidean Sum of Squares Ipq at the union of two clusters p and q is:

            IpEp q - Ep - E     Minimum

In ClustanGraphics we have implemented k-means to minimize the total Euclidean Sum of Squares E.   This is equivalent to minimizing the sum of the weighted squared Euclidean distances d2ip between each case i and the mean mp of the cluster p to which it belongs.  If you have mixed data types, ClustanGraphics forms the means for all states within binary and nominal variables, and computes distances using our unique general distance functions.

Other k-means programs relocate any case to the cluster with the nearest mean.  Given the above definitions, this simplistic approach may appear to minimize E.  However, with large datasets it does not necessarily converge quickly, or at all, because such relocations may not actually reduce E.

To minimize E we must only relocate a case i from cluster p to cluster q when Ep + Eq > Ep-i + Eq+i

We call this the exact relocation test for minimum E.  It is not the same as relocating case i to its nearest cluster mean, because any relocation from cluster p to cluster q causes consequential changes to the means of p and q; and, in certain circumstances, these changes may actually increase E. 

Put another way, relocating a case i from cluster p to cluster q pulls the mean of q towards it and pushes the mean of p away from it.  This can cause the distances from the mean of other cases in clusters p and q to increase, such that E is increased.  With large datasets, an oscillation of boundary cases between two or more clusters can result in successive iterations.

Since E is a sum of squares, relocating only those cases which yield a reduction in E must converge because E cannot be indefinitely reduced.

How did we discover this discrepancy?  When we programmed the simplistic k-means algorithm we were surprised that it failed to converge with very large datasets.  Our data mining example involving k-means on a million cases was still running after 750 iterations, with oscillating values of E.   But when we implemented the exact relocation test for minimum E, ClustanGraphics k-means converged in 236 iterations.

What does this mean to the user?   Firstly, you can be confident that our k-means analysis will converge, if allowed enough iterations.  Each iteration reduces the ESS, and the reduction in ESS can be monitored.

Secondly, our k-means analysis generally requires fewer iterations than other programs to reach a stable minimum ESS.  So we don't need to resort to special convergence criteria, such as specifying a maximum percentage movement in cluster centres to prevent indefinite oscillation.

Thirdly, the use of differential weights provides extra flexibility.  For example, in a customer segmentation study you can weight customers by their transaction volumes, thus emphasising those who contribute most to business turnover.  Such case weighting also provides a surrogate cluster analysis of individual business transactions.

Our k-means analysis is just another example of cool programming which sets ClustanGraphics apart from the competition.  To order ClustanGraphics on-line click ORDER now.