This illustrates the use of Navigate Tree with k-Means Analysis, on a large survey of 16,000 cases. From a random start,
ClustanGraphics converged to a 40-cluster solution in 20 iterations, taking just 4 seconds! We used outlier deletion to tighten the final cluster model, which removed 1007 outliers from the core model.
Cluster exemplars were computed, and the means for the final 40-cluster model were saved. A summary tree for the 40-cluster model was then computed by Ward's method, as shown above. The k-means
and hierarchical cluster analyses consistently optimized the Euclidean Sum of Squares (ESS) as the objective function - most other k-means programs are not consistent in this respect.
The nodes of the summary tree are automatically labelled by cluster exemplars, with cluster sizes in parentheses (these labels can be edited). The summary tree has also been optimally ordered, so that the horizontal cluster
order can be more easily interpreted. As the 6-cluster section of the 40-cluster model was highlighted above, it is then displayed in summary form
using Navigate Tree, as below. Clusters are identified by their level in the tree; thus Cluster +4 is the right-hand
cluster at the 4-cluster level. Cluster sizes and the cluster means for the Response Rate variable are displayed. We observe that the cluster Response Rates generally increase from left to right due to the tree having been
re-ordered, and this is confirmed by the t-tests shown in the cluster profiles: Cluster Profiles identifies significant cluster means in all the variables simultaneously. In the example, the
Response Rate variable is highlighted in red. It shows at a glance how the cluster means for all the variables compare at each level from 1 to 6 clusters.
It's easy to see that the 2 cluster level is differentiated on the Response Rate, with means of 2.02 in cluster -2
and 6.89 in cluster +2. The equivalent decision tree rule for the first split, or final fusion, would be: Response Rate > 4.5.
At the next level the first variable differentiates clusters -3 and +3. At the following cluster level, the first 3 variables are correlated in differentiating clusters -4 (high) and +4 (low), with variable 2 dominating.
Bear in mind that this is not a decision tree. Clusters are formed on all variables simultaneously, so the analysis is multivariate at each clustering level.
This example illustrated the following ClustanGraphics features: k-means analysis with outlier deletion on a large survey, summary tree by hierarchical cluster analysis, optimal tree ordering, Navigate Tree, t-tests on variables
and cluster profiling. |