To use direct data clustering, first read a data matrix into ClustanGraphics. You can do this by clicking the New Data button on the toolbar, or New Data on the File menu. Alternatively, read a ClustanGraphics file that contains a data matrix using File Open. In this example, we read a data matrix of 5 columns and 40,000 rows. Now select Cluster Data and choose a method of hierarchical cluster analysis from the pull-down list. Increase in Sum of Squares (Ward's Method) is the default - it minimizes the Euclidean Sum of Squares at each fusion in the resulting tree. Alternatively, you can choose Average Linkage, which minimizes the average of all the distances between the cases within the clusters. Click OK and your cluster analysis will be completed. This was quite a large one, which required 51 iterations and took 15 minutes to compute on a Pentium III PC. It's unlikely you will have data this large, but reassuring to know that ClustanGraphics is scalable to virtually any data mining applications you might encounter. If you have clustered a very large data matrix, it's prudent to truncate the tree to a manageable cluster model or save the tree to a file. In this case, the tree was truncated to the last 40 clusters: When you truncate your tree, cluster means are computed and your cluster model is available for further analysis. For example, you could use it as a starting strategy for k-means analysis since optimum k-means solutions can often be found by starting from a tree section. Or you could do an outlier analysis to remove the outliers and tighten up the cluster centres. You could also use your cluster model to classify new cases . That's all there is to it - a few clicks, and you've clustered 40 thousand cases by Ward's Method, truncated the tree, saved the cluster model and displayed a summary tree! |