Computing Proximities 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
Computing
Reading
Displaying
Neighbours
MDS
ClustanGraphics5 computes proximities from an input data matrix which may or may not be standardized. 

Supposing that the data have been transformed to standard scores, or z-scores, then a typical selection of five cases taken from the Mammals Case Study is as follows:

Standard scores (z-scores)

 

Water

Protein

Fat

Lactose

Ash

Deer

-0.955

1.147

0.893

-0.836

1.063

Donkey

0.946

-1.235

-0.847

1.129

-0.918

Rabbit

-0.535

1.667

0.265

-1.218

2.846

Seal

-2.475

0.955

3.013

-2.256

-0.026

Zebra

0.627

-0.879

-0.524

0.638

 -0.323

Standardization to z-scores has the effect that each column has  a mean of zero and standard deviation of 1.  See Data Transformations for further details.  The next step is to calculate proximities between all pairs of cases.  In ClustanGraphics it's simpe - just click Compute on the Proximities menu, and select a proximity coefficient from the drop-down list:

In this example, we have selected Squared Euclidean Distance.  The result is a proximity matrix of order 252. But since the diagonal elements are not important, and the matrix is symmetrical about the diagonal, there are effectively only 288 different distances.  For ease of presentation, we just show the proximities for five of the cases below:

Squared Euclidean Distance Matrix

 

Deer

Donkey

Rabbit

Seal

Zebra

Deer

0.000

4.020

0.833

2.009

2.542

Donkey

4.020

0.000

6.306

8.731

0.186

Rabbit

0.833

6.306

0.000

4.230

4.388

Seal

2.009

8.731

4.230

0.000

6.792

Zebra

2.542

0.186

4.388

6.792

0.000

Just by examining the proximities we can determine interesting details, such as the fact that donkey and zebra are quite similar, with a squared distance value of 0.186, whereas donkey and seal are the two most dissimilar cases, with a squared distance of 8.731. 

Of course it's not very practicable to examine all 288 proximities by inspecting the proximity matrix for 25 cases; and it certainly would not be practicable with 10,000 cases.  However, we can use Nearest Neighbour analysis to find the nearest neighbours; and of course, clustering the proximity matrix is the main way we can group the cases into clusters and thus describe the structure and diversity of the data.

See the ClustanGraphics Preview, where the Mammals Case Study is taken further.

Continuous Data
The similarity or dissimilarity coefficients provided with ClustanGraphics5 for continuous variables are as follows:

Binary Data
If your data comprise only binary variables, you can select from the following binary proximity coefficients:

These coefficients compare any two cases i and j across all M unmasked binary variables, as follows:
A = number of variables "present" in both cases i  and j
B = number of variables "present" in case i and "absent" from case j
C = number of variables "absent" from case i and "present" in case j
D = number of variables "absent" from both cases i and j
and A + B + C + D = M, the number of variables observed for both cases i and j.

If a variable is "missing" for either case i or case j, then it is not considered for the computation of the coefficient.  In this case M is the number of variables that are observed for both cases.

Binary Euclidean Distance (B+C)/M is a dissimilarity coefficient and the other two are similarity coefficients.  Use Binary Euclidean Distance if you intend to cluster by minimizing the Euclidean Sum of Squares (Ward's Method).  Details here

Mixed Data
If you have mixed data types, you can compute a proximity matrix from the following measures of proximity:

There is no program limit to the size of proximity matrix which can be computed; the limit is determined by the memory and disk resources available on the user's PC.  As a rough guide only, a reasonable Pentium PC is capable of computing proximities for up to about 10,000 cases with ClustanGraphics.  If you have a larger data matrix, we recommend that you use Direct Data Clustering which can produce a hierarchical cluster analysis for 100,000 cases, or more.

Note that ClustanGraphics can compute proximities from incomplete data - see missing values for details.

For further definitions and other details, please refer to the ClustanGraphics Primer and the ClustanGraphics Help file.