The General Distance Coefficients provided in ClustanGraphics are the complement of Gower's General Similarity Coefficient but with extensions that allow for distances between clusters, as well as distances between cases, to be computed for mixed data types. For details of how to specify mixed data types click here.
d _{ij
}^{2 }= S_{k} w_{ijk}(x_{ik} - x_{jk})^{2}
S_{k} w_{ijk}where: x_{ik} is the value of variable k in case i, andw _{ijk }is a weight of 1 or 0 depending upon whether or not the comparisonis valid for the kth variable; if differential variable weights are specified it is the weight of the kth variable, or 0 if the comparison is not valid. It should be noted that the effect of the denominator S
_{k}, its contribution to d_{ij}^{2} is defined for
ordinal and continuous variables as follows: d
The components of distance d When an ordinal or continuous variable is standardised to z-scores, the transformed values x
x d
Standardising to z-scores is generally preferable to standardising by range because the resulting values are not determined by the two extreme values x
If you choose not to transform an ordinal or continuous variable, the component of distance for that variable will not be divided by anything, i.e.
d This may be appropriate if all your variables are ordinal variables on the same scale, or if you have standardized your variables externally by some other transformation. There is a further discussion of data transformations here.
_{ijk}^{2}_{ }for binary and nominal variables is 0 if x_{ik }= x_{jk}, or 1 if x_{ik }¹ x_{jk}. For a binary variable, d For a nominal variable, d
_{ijk }for the comparison on the kth variable is usually 1 or 0. However, if you assign differential weights to your variables in ClustanGraphics, then w_{ijk }is either the weight of the kth variable
or 0, depending upon whether the comparison is valid or not. This allows larger weights to be given to important variables, or for another type of external scaling of the variables to be specified.If the weight of any variable is zero, then the variable is effectively ignored for the calculation of proximities. Such variables are "masked" for clustering, but available for cluster profiling, to assist in the interpretation of a resulting cluster analysis.
_{ij} is obtained by taking the square root of Squared Euclidean Distance Öd_{ij}^{2} as computed above.
w_{i}w_{j}d_{ij}^{2}
(w_{i}+w_{j})where w _{i} and w_{j} are the weights of cases i and j, respectively. This proximity measure is important when the
cases have differential weights, or are themselves clusters - for example, following truncation of a large tree or a k-means partition. Where w_{i} = w_{j} = 1 the distance is equal to ½d_{ij}^{2}.
d _{ij
}^{ }= S_{k} w_{ijk}|x_{ik} - x_{jk}|
S_{k} w_{ijk}City Block Distance is akin to the walking distance between two points in a city like New York's Manhatten district, where each component is the number of blocks in the directions North-South and East-West.
d _{ij
}^{ }=
max_{k} |x_{ik} - x_{jk}|This measure is appropriate if you wish to locate all the cases in a cluster within a hypercube of side 2 d, where d could be the outlier deletion threshold in k-means analysis. Since every case is within a distance d of the mean, it must lie inside the cell of size 2d . This criterion was added to ClustanGraphics so that it is comparable to CHAID analysis, which creates segments of hypercube shapes.
_{pk} of variable k for cluster p and computing a distance component for each variable k as follows:d _{ij
}^{2 }= S_{k} w_{ijk}(x_{ik} - m_{p}_{k})^{2
}
S_{k} w_{ijk}This is straightforward for ordinal and continuous variables. But with binary and nominal variables, Of course, these definitions also allow for missing values to be present, and for differential variable and case
weighting. Where the data may be incomplete, the weights w This uniquely flexible clustering capability sets Clustan software head and shoulders above the competition.
With ClustanGraphics, you don't have to force categorical data to behave like continuous variables, or categorize continuous variables to fit a c Our General Distance Coefficients have been available in Clustan since 1984, and in ClustanGraphics since release 5 in 2001. It's one of our best-kept secrets! To order ClustanGraphics on-line click |