General Distance Coefficients 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us

The General Distance Coefficients provided in ClustanGraphics are the complement of Gower's General Similarity Coefficient but with extensions that allow for distances between clusters, as well as distances between cases, to be computed for mixed data types.  For details of how to specify mixed data types click here.

Squared Euclidean Distance
The General Squared Euclidean Distance Coefficient compares two cases i and j, and is defined as follows :
                                      dij =
 Sk wijk(xik - xjk)2
                                                           Sk wijk
where:  xik    is the value of variable k in case i, and
               w ijk   is a weight of 1 or 0 depending upon whether or not the comparison
                      is valid for the kth variable; if differential variable weights are specified
                      it is the weight of the kth variable, or 0 if the comparison is not valid.

It should be noted that the effect of the denominator Sk wijk is to divide the sum of the distance scores by the number of variables; or if variable weights have been specified, by the sum of the weights.  Unlike Gower's General Similarity Coefficient, the distance scores are squared - this is so that you can cluster using Increase in Sum of Squares, which minimizes the Euclidean Sum of Squares .

Ordinal and Continuous Variables
When a variable k has been standardised by dividing by its range
rk, its contribution to dij2 is defined for ordinal and continuous variables as follows:

            dijk2 = (xik/rk - xjk/rk)2

                  = (xik - xjk) 2/rk2

The components of distance dijk 2 for ordinal and continuous variables therefore range between 0, for identical values xik = xjk, and 1, for the two extreme values xmax - xmin.

When an ordinal or continuous variable is standardised to z-scores, the transformed values x ik* have a mean of zero and a standard deviation of 1.  Thus

                                   xik* =  xij - mk
where  mk is the mean and sk is the standard deviation of variable k.   The distance component is therefore equivalent to:

            dijk2 = (xik - xjk)2/sk2

Standardising to z-scores is generally preferable to standardising by range because the resulting values are not determined by the two extreme values x max and xmin but by the dispersion of values on variable k, as measured by its variance s k2 or standard deviation sk.  For this reason, standardising to z-scores is recommended.

If you choose not to transform an ordinal or continuous variable, the component of distance for that variable will not be divided by anything, i.e.

            dijk2 = (xik - xjk)2

This may be appropriate if all your variables are ordinal variables on the same scale, or if you have standardized your variables externally by some other transformation.  There is a further discussion of data transformations here.

Binary and Nominal Variables
The value of
dijk2 for binary and nominal variables is 0 if xik = xjk, or 1 if xik xjk

For a binary variable, dijk2 = 0 if cases i and j both have attribute k "present" or both "absent", or 1 if attribute k is "present" in one case and  "absent" in the other case.  This definition differs from Gower's General Similarity Coefficient in respect of negative matches, which are not ignored - this is an important distinction when it comes to estimating the distance between clusters of cases, where it is not possible to ignore negative matches.

For a nominal variable, d ijk2 = 0 if cases i and j have the same "state" for variable k, or 1 if they have different "states".

Differential Variable Weights
It was noted above that the weight w
ijk for the comparison on the kth variable is usually 1 or 0.  However, if you assign differential weights to your variables in ClustanGraphics, then wijk is either the weight of the kth variable or 0, depending upon whether the comparison is valid or not.  This allows larger weights to be given to important variables, or for another type of external scaling of the variables to be specified.

If the weight of any variable is zero, then the variable is effectively ignored for the calculation of proximities.  Such variables are "masked" for clustering, but available for cluster profiling, to assist in the interpretation of a resulting cluster analysis.

Euclidean Distance
Euclidean Distance dij is obtained by taking the square root of Squared Euclidean Distance
dij2 as computed above.

Increase in Sum of Squares
Increase in Euclidean Sum of Squares is obtained from:
where wi and wj are the weights of cases i and j, respectively.  This proximity measure is important when the cases have differential weights, or are themselves clusters - for example, following truncation of a large tree or a k-means partition.  Where wi = wj = 1 the distance is equal to

City Block Distance
City Block Distance, or the "Manhatten" metric, is the sum of the distances on each variable, defined as follows:
                                      dij   =
 Sk wijk|xik - xjk|
                                                         Sk wijk

City Block Distance is akin to the walking distance between two points in a city like New York's Manhatten district, where each component is the number of blocks in the directions North-South and East-West.

Maximum Distance
Maximum Distance defines the distance between any two cases as the maximum distance score for any of the active variables:
                                      dij   =
  maxk |xik - xjk|

This measure is appropriate if you wish to locate all the cases in a cluster within a hypercube of side 2 d, where d could be the outlier deletion threshold in k-means analysis.  Since every case is within a distance d of the mean, it must lie inside the cell of size 2d .  This criterion was added to ClustanGraphics so that it is comparable to CHAID analysis, which creates segments of hypercube shapes.

Distances Between Clusters
In hierarchcical cluster analysis and k-means analysis, it is necessary to compute the distance between a case and a cluster, or between two clusters.  We do this by forming the mean
m pk of variable k for cluster p and computing a distance component for each variable k as follows:
                                      dij =
 Sk wijk(xik - mpk)2
                                                           Sk wijk

This is straightforward for ordinal and continuous variables.  But with binary and nominal variables, mpk is a vector {jpks} of probabilities of occurrence for each state s of variable k for cluster p.  The distance component d ijk2 between two clusters (in hierarchical cluster analysis) or a case and a cluster (in k-means analysis) is computed between these vectors, and standardized.  The chosen criterion function, including Euclidean Sum of Squares, can then be optimized in terms of the sum of the distance components across all variables k, thereby allowing for mixed data types most generally in cluster analysis. 

Of course, these definitions also allow for missing values to be present, and for differential variable and case weighting.  Where the data may be incomplete, the weights wijk are zero for any missing values, and the means mpk are estimated from complete observations only, weighted by case weights where these may differ.

This uniquely flexible clustering capability sets Clustan software head and shoulders above the competition.  With ClustanGraphics, you don't have to force categorical data to behave like continuous variables, or categorize continuous variables to fit a c2 (chi-squared) analytical framework, as in decision trees.  With ClustanGraphics, we represent your data exactly as you present it to the program, and compile cluster statistics without warping your data to fit the method.

Our General Distance Coefficients have been available in Clustan since 1984, and in ClustanGraphics since release 5 in 2001.  It's one of our best-kept secrets! 

To order ClustanGraphics on-line click ORDER now.