Data Transformations 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
This section discusses aspects of data transformation, which is often necessary prior to clustering.  The general objective is to ensure that each variable in the data collected is given appropriate weight in the analysis.

With ClustanGraphics it is possible to cluster with complex data structures involving different types of variables.  But for the purposes of this discussion, let's suppose that all the variables are measured on a continuous or semi-continuous scale.

For example, in the Mammals Case Study, the 5 variables are the percentages of Water, Protein, Fat, Lactose and Trace Elements (Ash) in the milk of different mammals.  Five lines of data are reproduced below:

Composition of mammals milk (percentages)

 

Water

Protein

Fat

Lactose

Ash

Deer

65.9

10.4

19.7

2.6

1.4

Donkey

90.3

1.7

1.4

6.2

0.4

Rabbit

71.3

12.3

13.1

1.9

2.3

Seal

46.4

9.7

42.0

-

0.85

Zebra

86.2

3.0

4.8

5.3

0.7

For the full data set, please refer to the file MammalsLabelled.txt distributed with ClustanGraphics.

Note that the variables' ranges differ quite markedly, from water with a range of 44 and fat with a range of 40, to ash with a range of less than 2.  If we wish the variables to be treated equally, as is often the case, it will be appropriate to standardize the values by dividing by each variable's range or standard deviation.  Otherwise, clustering on the above table would be dominated by the diversity of water and fat, and ash would have a negligible influence on any measure of cluster variance (ash, or trace elements, is arguably the most important variable).

We note that the percentages for rabbit and seal do not sum to 100 - perhaps due to measurement or transcription error.  Furthermore, it is not clear whether the value of lactose for seal is zero, or missing; for this analysis we shall take a value of zero, but with ClustanGraphics we can denote the value as missing.  As the analyst, there's not a lot we can do about these data queries - the data were published in 1956, so asking the authors for an explanation is not very practicable.

Dividing by the standard deviation obtains z-scores, which have a mean of zero and standard deviation of 1, for each variable.  Thus, each variable contributes equally to the variance in the analysis:

Standard scores (z-scores)

 

Water

Protein

Fat

Lactose

Ash

Deer

-0.955

1.147

0.893

-0.836

1.063

Donkey

0.946

-1.235

-0.847

1.129

-0.918

Rabbit

-0.535

1.667

0.265

-1.218

2.846

Seal

-2.475

0.955

3.013

-2.256

-0.026

Zebra

0.627

-0.879

-0.524

0.638

 -0.323

Dividing by the range provides a different diversity of values, as shown below:

Standardization to Unit Range

 

Water

Protein

Fat

Lactose

Ash

Deer

0.462

0.838

0.456

0.377

0.591

Donkey

0.998

0.094

0.010

0.899

0.136

Rabbit

0.580

1

0.295

0.275

1

Seal

0.033

0.778

1

0

0.341

Zebra

0.908

0.205

0.093

0.768

0.273

Here we can see that rabbit has the maximum value for both protein and ash, while seal has the maximum for fat and the minimum for lactose.  (These transfromations relate to the full dataset of 25 mammals, and not to the selection of five cases used here for illustration).

Having transformed the data, we are now in a position to compute proximities and start clustering.  Please refer to the ClustanGraphics Primer for definitions.  Further details about computing proximities can be found here .