Missing Values 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us
Hierarchical cluster analysis, k-means and outlier analysis are now possible with missing values.  Your data matrix can be incomplete and your proximity matrix can be incomplete.

There are two ways to specify missing values.  You can enter any non-numeric character or string wherever you have a missing value or proximity, and ClustanGraphics will automatically recognise it as a missing entry.

Alternatively, you can code all your missing values with a special missing value code, and specify that it represents a missing value.  Just click "Missing" when reading data or proximities and enter the missing value code.

When you compute proximities from a data matrix which includes missing values, ClustanGraphics estimates the proximity between any two cases from only those variables which have a valid entry for both cases.  This is sometimes referred to as the pairwise deletion treatment for missing values.  If no valid comparison can be made for any variable, then the proximity for the two cases will be missing.

A hierarchical cluster analysis obtained by clustering proximities allows for any missing proximities.  They are ignored in the search for the best cluster union at each fusion step.  When a cluster is formed, the proximities between it and the other remaining clusters are estimated from the observed proximities, again ignoring any that are missing.

You should be aware that ignoring missing proximities, and computing new proximities from an incomplete proximity matrix, might distort your resulting tree.  The more values which are missing, the greater is the potential distortion.  On the other hand, there is a great degree of redundancy in a proximity matrix.  For example, a proximity matrix for 1000 cases contains 99,950 values which are used to estimate the 999 fusion steps in the tree.  This represents an in-built redundancy of 99%.  So the random deletion of a proportion of the proximities should not greatly distort your resulting tree.

At the highest classification levels the distortion should be least.  This arises simply from the fact that a cluster model is obtained by aggregating similar cases and estimating cluster means from the observed values.  If the missing values are randomly distributed, the effect on the resulting cluster centres should be minimal.

In k-means and outlier analysis, cluster means are estimated from the complete observations on each variable within each cluster, i.e. ignoring any missing values.  Cluster means can contain missing values, e.g. where there are no valid observations on a variable within a cluster.  This can easily occur where there are singleton clusters containing cases with incomplete data. 

The distance between a case and a cluster is estimated from all pairs of valid entries for the case and the cluster mean; distances are designated as missing where there are no complete pairs of valid entries on any variable.  The Euclidean Sum of Squares is estimated from the valid distances between the cases and the clusters to which they are assigned.  For further details, please refer to the ClustanGraphics Primer.

It can be shown that the missing value treatment used in ClustanGraphics generally introduces less bias than mean substitution, the main alternative missing value treatment offered by our competitors.  It's yet another example of a unique feature which we are able to offer by specialising in quality software design for cluster analysis applications.

Clustan - A Class Act © 1998 Clustan Ltd