Navigate k-Means 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us
This illustrates the use of Navigate Tree with k-Means Analysis, on a large survey of 16,000 cases.   From a random start, ClustanGraphics converged to a 40-cluster solution in 20 iterations, taking just 4 seconds!  We used outlier deletion to tighten the final cluster model, which removed 1007 outliers from the core model.  Cluster exemplars were computed, and the means for the final 40-cluster model were saved.

Clustering by k-means analysis on  16,000 cases with 40 final clusters and outlier deletion

Summary tree for the 40-cluster model obtained using k-means analysis with outlier deletion.  Cluster labels show the code of each cluster's exemplar and its size in parentheses.

A summary tree for the 40-cluster model was then computed by Ward's method, as shown above.  The k-means and hierarchical cluster analyses consistently optimized the Euclidean Sum of Squares (ESS) as the objective function - most other k-means programs are not consistent in this respect.

The nodes of the summary tree are automatically labelled by cluster exemplars, with cluster sizes in parentheses (these labels can be edited).  The summary tree has also been optimally ordered, so that the horizontal cluster order can be more easily interpreted.

As the 6-cluster section of the 40-cluster model was highlighted above, it is then displayed in summary form using Navigate Tree, as below.  Clusters are identified by their level in the tree; thus Cluster +4 is the right-hand cluster at the 4-cluster level.  Cluster sizes and the cluster means for the Response Rate variable are displayed.

Cluster means for the Response Rate variable and cluster sizes, to the 6-cluster solution.

We observe that the cluster Response Rates generally increase from left to right due to the tree having been re-ordered, and this is confirmed by the t-tests shown in the cluster profiles:

Cluster profiles showing t-tests on 5 variables, for the k-means cluster model of 14,993 cases.

Cluster Profiles identifies significant cluster means in all the variables simultaneously.  In the example, the Response Rate variable is highlighted in red.  It shows at a glance how the cluster means for all the variables compare at each level from 1 to 6 clusters.

It's easy to see that the 2 cluster level is differentiated on the Response Rate, with means of 2.02 in cluster -2 and 6.89 in cluster +2.  The equivalent decision tree rule for the first split, or final fusion, would be: Response Rate > 4.5.

At the next level the first variable differentiates clusters -3 and +3.  At the following cluster level, the first 3 variables are correlated in differentiating clusters -4 (high) and +4 (low), with variable 2 dominating.

Bear in mind that this is not a decision tree.  Clusters are formed on all variables simultaneously, so the analysis is multivariate at each clustering level.

This example illustrated the following ClustanGraphics features: k-means analysis with outlier deletion on a large survey, summary tree by hierarchical cluster analysis, optimal tree ordering, Navigate Tree, t-tests on variables and cluster profiling.