With ClustanGraphics it is possible to cluster with complex data structures involving different types of variables. But for the purposes of this discussion, let's suppose that all the variables are measured on a continuous or semicontinuous scale. For example, in the Mammals Case Study, the 5 variables are the percentages of Water, Protein, Fat, Lactose and Trace Elements (Ash) in the milk of different mammals. Five lines of data are reproduced below:
Composition of mammals milk (percentages)
For the full data set, please refer to the file MammalsLabelled.txt distributed with ClustanGraphics. Note that the variables' ranges differ quite markedly, from water with a range of 44 and fat with a range of 40, to ash with a range of less than 2. If we wish the variables to be treated equally, as is often the case, it will be appropriate to standardize the values by dividing by each variable's range or standard deviation. Otherwise, clustering on the above table would be dominated by the diversity of water and fat, and ash would have a negligible influence on any measure of cluster variance (ash, or trace elements, is arguably the most important variable). We note that the percentages for rabbit and seal do not sum to 100  perhaps due to measurement or transcription error. Furthermore, it is not clear whether the value of lactose for seal is zero, or missing; for this analysis we shall take a value of zero, but with ClustanGraphics we can denote the value as missing. As the analyst, there's not a lot we can do about these data queries  the data were published in 1956, so asking the authors for an explanation is not very practicable. Dividing by the standard deviation obtains zscores, which have a mean of zero and standard deviation of 1, for each variable. Thus, each variable contributes equally to the variance in the analysis: Standard scores (zscores)
Dividing by the range provides a different diversity of values, as shown below: Standardization to Unit Range
Here we can see that rabbit has the maximum value for both protein and ash, while seal has the maximum for fat and the minimum for lactose. (These transfromations relate to the full dataset of 25 mammals, and not to the selection of five cases used here for illustration). Having transformed the data, we are now in a position to compute proximities and start clustering. Please refer to the ClustanGraphics Primer for definitions. Further details about computing proximities can be found here .
