ClustanGraphics Tutorial
About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us

Reading Data
Cluster Keys
Classify Cases
Auto Script
This is a short tutorial based on the Mammals' Milk case study, designed to introduce new students to hierarchical cluster analysis with ClustanGraphics.  It cross-refers to Cluster Analysis, Fourth Edition, 2001, by Brian S. Everitt, Sabine Landau and Morven Leese, Arnold, London and Oxford University Press, New York.  If you don't have a copy of the book yet, don't worry - it is helpful, but not essential, to this tutorial. 

To complete the tutorial properly you should copy the answer sheet into a word processor such as Microsoft Word, because you will find it easiest to paste some results directly from ClustanGraphics into it.

Our data set Mammals.xls gives the percentage composition of 25 mammals' milk on five variables.  It was taken from Hartigan, J A (1975), Clustering Algorithms,  Wiley 1975, p. 304.  The original source is Spector, W S (1956), Handbook of Biological Data, Saunders.

We suggest, firstly, that you view the raw data in Excel.  Observe that there are large variations in water, from 45% to 90%, and fat, from 1% to 42%.  Can you suggest why? [Q1]  By comparison, ash (or trace elements) ranges from 0.1% to 2.3%.  You can check these figures easily by sorting the spreadsheet on each variable in Excel.  What are the variations of protein and lactose?  Write these values in your answer sheet.  [Q2]

If we were to classify the data table in its raw form, the resulting distances would be dominated by the larger variations in fat and water, and the contribution of ash would be comparatively negligible - hardly worth including it in the data; yet trace elements are arguably one of the most important constituents of milk.  We therefore standardize the variables so that they have zero mean and unit standard deviation, thus giving them equal weight, and this is done by default at step 2 of Clustan Wizard.  When you have run the Wizard you can check the standard scores (z-scores) by clicking View/Data and comparing the figures with Table 4.4 in Cluster Analysis, p. 87.  Copy this table into your answer sheet.  [Q3]

We have clustered the standardized data using Average Linkage for consistency with the book, but you could now experiment with other clustering methods.  Either go back to Clustan Wizard and change the method at step 4; or click Cluster/Proximities.  Try out different methods - for example, if you use Single Linkage you may not find much structure, but you should discover four outliers - what are they?  Write your suggestions in your answer sheet.  [Q4]

Returning to Average Linkage, the number of clusters was checked for a significant departure from the distribution of fusion values using the upper tail rule in Best Cut.  Find the criterion values using Tree/Best Cut and compare them with the table at the top of page 87 in Cluster Analysis.  Copy these values to your answer sheet. [Q5]

The 4-cluster solution can be readily interpreted, and the t-statistic for this solution is significant at p=0.05.  This broadly means that the sequence of fusion values conforms reasonably to a normal distribution until the point at which 4 clusters are reduced to 3 clusters, when there is a significant jump and a departure from normality.  What is this value?  You can find it in View/Tree Data, and it occurs when the Cat and Deer clusters are combined.  Write the fusion value in your answer sheet.  [Q6]

How can we interpret these four clusters?  Click Cluster/Profiles and the means of the five variables for each cluster will be displayed - you can step through each cluster by clicking on the graph of cluster means.  Now click Table to display the whole table, copy it to the clipboard, and paste it into your answer sheet.  [Q7]  For the benefit of our website visitors, that is what we have done here:































Now compare these results with Table 4.5 on page 87 of Cluster Analysis.

We mentioned on the Clustan Wizard page that the tree has been optimally seriated - this means that the cases are arranged along the left side of the tree so as to make it easier to interpret.  You can view the standard tree order by clicking Order/Tree, and the optimal tree order by clicking Order/Serialize.  Compare the two arrangements of the tree with figure 4.14 on page 88 of Cluster Analysis.  Which dendrogram do you prefer?  Copy it to you answer sheet.  [Q8]

Now open the table of cluster members in Excel, or another spreadheet.  You should find it on the text file "Members Table File.txt".  The value in each column is the number of the cluster to which the corresponding case belongs.  Try to sort the 4-cluster column in descending order and copy the sorted table to your answer sheet.  [Q9]  How many ccases are assigned to each of the 4 clusters?  [Q10]

Congratualtions!  You have completed your first ClustanGraphics tutorial and, in doing so, you have explored some of the special features that make our software very easy to use, visual and unique.  If you have some time left, jot down a few ideas on what you think of ClustanGraphics and the exercise in your answer sheet.  If you like, you can e-mail your thoughts to us here.

This is just the start of your adventure with ClustanGraphics.  Try out some of the other features of the program, or load the more comprehensive file Mammals.cls and explore scatter plots and cluster models.

Thank you very much for your time and interest.

Clustan - A Class Act © 1998 Clustan Ltd