Bootstrap Validation 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
Many of our users frequently ask what is the best number of clusters for my data, and is the classification significant?   Our new procedure Tree/Validate attempts to answer this by comparing the tree obtained for the given data with a series of trees generated at random from the same data or proximity matrix.

It is often argued that cluster analysis always finds clusters even where there is no underlying structure in the data, and this is true.  For example, the sample datasets that we distribute include several files of randomly produced data, and ClustanGraphics will produce a tree from any of these files even though the data display no structure at all.

This is not unreasonable.  For example, if the data are drawn from a single multivariate normal distribution with no covariance, cluster analysis will partition it into a tessellation of spherical clusters that resembles a honeycomb.  This does not necessarily mean that such a partition is invalid or of no practical use.  For example, the graph of peoples´ height against weight can reveal typical features corresponding to conditions of anorexia (models and ballerinas), obesity (fast food diners) and normal people.  Such cluster segments, though drawn from a single continuous pattern, contain meaningful types and are routinely used to assess whether people are overweight by their height.

Our approach to validating a classification is, therefore, to expect to find structure in data and to search for partitions that manifest the greatest departure from randomness.  In statistical terms, we test the null hypothesis that the structure displayed by a partition of a given tree is random, and seek to reject the hypothesis. 

To do this for a proximity matrix, we destroy any structure it contains by randomising the proximities.  Alternatively, we can randomise the input data, compute a proximity matrix from the randomised data, and hence obtain a tree.  The latter approach is recommended, especially for large datasets because randomising a proximity matrix becomes expensive with large data.

Having obtained a randomised variant of the data or proximities, we repeat this again for a series of random trials.  Each trial generates a different tree for the given data, in random order, and the series of trials provides a mean tree and confidence interval.  We then compare the given tree with the randomised trees, looking for a significant departure from random.  This is where we test the null hypothesis that the given data are random, and hence contain no structure.

For the Mammals sample data, a graph is obtained of the following form.

Sensitivity Graph and Tree for Mammals Milk

The blue solid at the bottom of the graph shows the fusion values corresponding to the actual data as presented.  The yellow band shows the range of fusion values obtained from 120 trials of randomising the data; in this confidence interval, the central line represents the mean of the fusion values for each number of clusters, obtained from the random trials.  The width of the yellow band is 1 standard deviation about the mean.

The red zones now show where the fusion values for the given data depart significantly from random.  In the case of the Mammals data, the most significant departure from random occurs at 4 clusters.  This is because the fusion value at 4 clusters is 0.75 compared with a mean value for randomised data of 2.31 being the largest difference of 1.56 for all partitions.  However, we note that 16 of the 24 partitions are significant.

By comparison, we have clustered a file which contains 25 cases of data drawn at random from a uniform distribution in 4 dimensions.  The result is as follows.

Sensitivity Graph and Tree for random data

Here we can see that there are no significant partitions of the data, and the corresponding tree displays a characteristic shape for random data, having no large increases in fusion values.

It is evident that the tree for the randomised data exhibits a more uniform series of fusion values than for the given data.  It is also interesting to note that if the increase in the Euclidean Sum of Squares is minimised at each fusion, both trees start at the same fusion values – at or near zero - and broadly end at the same cumulative sum.  This is because the Euclidean Sum of Squares is a constant for one cluster even if the proximities are randomised.

We can explain this intuitively.  To take a simple example, if the data simply comprise just two points each repeated n/2 times, then there should be a clear structure at the two-cluster stage.  In fact, for this simple example, the error is zero for 2 clusters regardless of the size of n, the total number of points.  But this would not be true of the randomised data, so the difference between the fusion values for the given data and the randomised data will be greatest at the two cluster level.

It should be noted that the procedure does not work for a data matrix of a single variable, as randomising it merely has the effect of rearranging the points.  However, where there are two or more variables, any correlation between the variables and structural relationships between the cases will be destroyed.  If there is a structure in the original data, this would normally be explainable by a reduced number of dimensions such as can be obtained using multi-dimensional scaling on the proximity matrix.  Such reduced dimensionality will also be destroyed by the randomisation, and it follows that randomisation will usually increase the dimensionality.  However, the dimensionality will not increase beyond the number of variables, which is not the case if the proximity matrix is randomised. 

Another way of presenting this concept is that each case in the randomised data matrix is valid in terms of the original survey or data source.  Its responses are all valid responses, and the distribution of the randomised responses on each variable is preserved.  If it is a survey, then each random case can be represented in a valid questionnaire that has been completed in a totally random manner.

It does not make sense to carry out random trials on the entire tree for a large or very large survey.  This would be very time-consuming and wasteful, because we are generally not at all interested in the early fusions.  A hierarchical classification is of most interest towards the end of the analysis, when the cases reorganised into a relatively small number of homogeneous clusters, and it is in this area of the tree – at the end – that we wish to know whether there is structure or randomness.

We have therefore devised a two-step validation procedure.  A hierarchical classification is first completed for the given data and a complete tree is obtained.  Suppose it represents a survey of 2500 cases.  At the early stages, identical and very similar cases are being clustered together – we can represent these small clusters by the cluster mean, and thus reduce the complexity of what follows.  The tree is therefore truncated at a fairly large number of clusters representing a tessellation of the given data.  For the example of a survey of 2500 cases we might choose 100 clusters, each comprising about 25 cases.  We extract the cluster means and then randomise these mean variables for each variable in the data; and this is repeated for a specified number of random trials.

The result is a set of sub-trees from the point of the tessellation down to one cluster, and we collect the distributional values at each fusion step for all the sub-trees.  The mean and standard deviation are obtained, and this range is presented visually and compared with the corresponding sub-tree for the given data.  If the tree for the given data is lower than the mean of the randomised data, this indicates a departure from randomness that is represented by cluster structure in the given data.  Where this discrepancy is most significant indicates good structure in the given tree, possibly the best number of clusters though it may still be necessary to evaluate more than one cluster level – for example, to reveal both a coarse (low) and a fine (higher) cluster model.

Parameters for Bootstrap Validation 

Tree Validation provides an options page (above) which displays the current options in effect.  All are set to sensible values by the procedure, but some can be varied by the user.  They are as follows:

Number of cases confirms the sizr of the given data set.  This cannot be varied.

Randomise tree at 100 cluster level, is the level at which the tree for the given data is truncated, the tessellation to be randomised.  Can be varied.

Number of random trials is the number of separate times the tessellated data is randomised to obtain a random sub-tree.  Can be varied.

Evaluate and display 10 final fusions is the length of the fusion sequence to be displayed on the final graph and in the output tables.  Can be varied.

Significance test is the t-statistic at which a departure from randomness is deemed to be significant.  Can be varied, and is also modified by a scroll button on the graph.

Confidence interval is a multiple of the standard deviation that is shaded yellow, to indicate the range of fusion values obtained in the random trials.  Can be varied.

Reset cluster weights causes the clusters to be given equal weights at the point where the tessellation is formed.  This corresponds to the cluster level at which the tree is first randomized, and is usually less than the number of cases for a large data set.

Sample with replacement allows the data to be drawn at random with replacement, so that some data values may appear more than once in the randomised data matrix.  If this is not checked, each data value in the given data matrix will appear only once in the randomised data matrix.  The "with replacement" option generates a wider confidence interval for the series of fusion values obtained from the random trials.

When the user exits Tree/Validation, the partition corresponding to the most significant departure from randomness is shaded as the best number of clusters.  This is currently the largest absolute deviation at all the fusion levels being evaluated.  We have not used the t-statistic for this test because trials have shown that a large value for the t-statistic can be obtained at too early a stage in the fusion sequence, mainly because the standard deviation for the randomised trials is zero or very small.

Tree/Validate is currently only available with Cluster/Proximities.  This is because it needs to generate a proximity matrix during random trials, which is not available in other clustering procedures.

Clustan - A Class Act © 1998 Clustan Ltd