Gower's Similarity Coefficient 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
Gower's General Similarity Coefficient is one of the most popular measures of proximity for mixed data types.  For details of mixed data types click here.

Gower's General Similarity Coefficient sij compares two cases i and j, and is defined as follows:
                                      sij
=  Sk wijksijk
                                                    Sk wijk

where:  sijk  denotes the contribution provided by the kth variable, and
               wijk is usually 1 or 0 depending upon whether or not the comparison is
                     valid for the kth variable; if differential variable weights are specified
                     it is the weight of the kth variable or 0 if the comparison is not valid.

It should be noted that the effect of the denominator Sk wijk is to divide the sum of the similarity scores by the number of variables; or if variable weights have been specified, by the sum of their weights.

Ordinal and Continuous Variables
Gower defines the value of
sijk for ordinal and continuous variables as follows:

            sijk = 1 - | xik - xjk | /rk

where:   ris the range of values for the kth variable.

For continuous variables sijk ranges between 1, for identical values xik = xjk, and 0, for the two extreme values xmax - xmin.

Binary Variables
 

Value of attribute k

Case i

+

+

-

-

Case j

+

-

+

-

sijk

1

0

0

0

wijk

1

1

1

0


For a binary variable (or dichotomous character), Gower defines the component of similarity and the weight according to the table (right), where
+ denotes that attribute k is "present" and - denotes that attribute k is "absent".

Thus sijk = 1 if cases i and j both have attribute k "present" or 0 otherwise, and the weight wijk causes negative matches to be ignored.  If negative matches are not to be ignored, the variable should be specified as a nominal variable (see below).

If all your variables are binary, then Gower's General Similarity Coefficient is equivalent to Jaccard's Similarity Coefficient A/(A+B+C) since the negative matches scored in cell D are ignored.

Nominal Variables
The value of
sijk for nominal variables is 1 if xik = xjk, or 0 if xik xjk.  Thus sijk = 1 if cases i and j have the same "state" for attribute k, or 0 if they have different "states", and wijk = 1 if both cases have observed states for attribute k.

Differential Variable Weights
It was noted above that the weight w
ijk for the comparison on the kth variable is usually 1 or 0.  However, if you assign differential weights to your variables in ClustanGraphics, then wijk is either the weight of the kth variable or 0, depending upon whether the comparison is valid or not.  This allows larger weights to be given to important variables, or for another type of external scaling of the variables to be specified.

If the weight of any variable is zero, then the variable is effectively ignored for the calculation of proximities.  Such variables are "masked" for clustering, but available for cluster profiling, to assist in the interpretation of a resulting cluster analysis.

General Distance Coefficients
If you specify mixed data types in ClustanGraphics and select Gower's Similarity Coefficient in Compute/Proximities, your proximity matrix will be calculated according to the above definitions.

However, the clustering options available using Gower are restricted to those applicable to similarity measures, and not to dissimilarities.  Thus, for example, you will not be able to optimize the Euclidean Sum of Squares without first transforming your proximities into distances.  For details of the corresponding General Distance Coefficient, click here.

Our implementation of Gower's General Similarity Coefficient is another example of the great flexibilty provided in Clustan software.  Mixed data types frequently occur in social surveys and databases, but you are unlikely to find that other software for cluster analysis or neural networks adequately caters for such practical diversity.

Gower's General Similarity Coefficient has been available in Clustan since 1984, and in ClustanGraphics since release 5 in 2001.  A worked example of Gower's coefficient with psychiatric data is given here.

To order ClustanGraphics on-line click ORDER now