divisive clustering tool designed to discover the key features that discriminate between clusters. It partitions any cluster on the basis of the best cut on a single key variable. It is also very fast, requiring only the data matrix to be read.Cluster Keys starts with the whole dataset and examines each variable in turn for the best split into two cluster subsets, for example that division into two clusters which results in the maximum reduction in the Euclidean Sum of Squares. It is therefore analogous of agglomerative clustering to minimize the Euclidean Sum of Squares, or Ward's Method, but working by division rather than agglomeration. If the best split of the dataset is on a binary variable, Cluster Keys divides on the presence or absence of the key binary variable. If the best split is an ordinal or continuous variable, Cluster Keys divides on a cut value between x and y, where x and y are values chosen from the data by Cluster Keys. One cluster subset will contain the values x and lower on the chosen variable, and the other cluster subset will comprise values y and higher. The following dialogue gives the abbreviated results for Cluster Keys with the Mammals Dataset. The first division gives a simple divisional key for classifying the data, such that the best discriminating variable is found that splits the dataset into two clusters. In the above example, the first division is on Lactose > 3.3%, reducing the Euclidean Sum of Squares by 69.9, the maximum possible split on any variable, forming clusters 1 and 5. The next step is to examine each of the two resulting clusters and find the best further split of one of these clusters into two subsets, thus forming 3 clusters. In the example, this second best split is for Water > 46.5% on cluster 5, forming clusters 5 and 7. Cluster Keys continues in this way until the whole dataset has been subdivided into singleton clusters or cases, at each step finding the maximum reduction in the Euclidean Sum of Squares. The result is a hierarchical classification obtained by division, the reverse of hierarchical clustering by agglomeration (e.g. Cluster/Proximities). When small clusters are being subdivided, the same split will frequently be identified for two or more variables, and of course this always occurs with clusters of size 2. In this instance, the "best" split is ambiguous – any one of the variables could be selected as the best divisional key. Cluster Keys therefore reports all ambiguous divisions, and these keys should not be applied in practice. Upon completing the division tree, the results can be copied to a document or spreadsheet, for example into Excel as shown below. They keys can thus be used for further analysis or classification. At present, Cluster Keys cannot sensibly divide on a nominal variable, but it is planned to provide this in a future version – any nominal variables will currently be excluded from the analysis. However, it is of course open to the user to re-code each nominal variable as a set of dummy binary variables. Cluster Keys completed a hierarchical division tree for 10,000 cases by 100 variables in under a minute on a Pentium III PC. A dataset of 120,000 cases and 100 continuous variables was divided hierarchically in just 22 minutes (right). The results for Cluster Keys are currently displayed as a tree for which cluster diagnostics, membership lists, best cut significance tests, navigate tree , and all other ClustanGraphics tree functions are available. Cluster Keys was introduced in ClustanGraphics 8.02, July 2005 To order ClustanGraphics on-line click |