Reading Proximities 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
In some clustering applications we start with proximities between cases.  For example, in our Proteins case study, we were given the matrix of all comparisons between 20 species.  The first 5 rows of this proximity matrix are as follows:

 

Man

Monkey

Dog

Horse

Donkey

Man

0

1

13

17

M

Monkey

1

0

12

16

15

Dog

13

12

0

10

8

Horse

17

16

10

0

1

Donkey

M

15

8

1

0

... and so on, for 20 rows and columns.  It is a 20x20 matrix, i.e. 400 cells.

The value in each cell is the number of positions in the cytochrome-c molecule where the proteins for the two species have different amino acids.   Note that the diagonal elements are all zero, and that the matrix is symmetric; so in reality, there are only 190 proximities in the matrix.

Note also that it is possible to specify that some proximities are missing, e.g. the proximity between Man and Donkey in the above example.  Just enter any non-numeric values such as "M" and ClustanGraphics will detect that the corresponding proximities are missing values and allow for them.

Now it's fairly easy to see that Man and Monkey are closely related on this criterion, as are Horse and Donkey, and that Dog is fairly distinct from these two clusters.  Clustering proximities "by inspection" is quite easy with 5 cases, or even with 10; but it's not so easy when you have 500 cases, or 5000 cases.  That's where Clustan software becomes essential.

Proximities can be either of type dissimilarity (as above) or of type similarity .  In the proteins case study, the proximities would be similarities if the value in each cell is the number of positions in the cytochrome-c molecule where the proteins for the two species have the same amino acids.

ClustanGraphics can read read proximities in four formats, illustrated below:

      Square         Upper             Lower         Proximity
      Matrix       Triangular       Triangular         List
    o x x x x       x x x x                           2 1 x
    x o x x x         x x x           x               3 1 x
    x x o x x           x x           x x             3 2 x
    x x x o x             x           x x x           4 2 x
    x x x x o                         x x x x         1 3 x
     

In all formats, the diagonal elements are disregarded; however, a value should be entered for each diagonal element when using Square Matrix format.

The Upper Triangular and Lower Triangular formats treat the proximity matrix as symmetric.

The Square Matrix format can be symmetric or asymmetric.  If it is asymmetric, then ClustanGraphics will convert it to a symmetric matrix.  There are 4 conversion options:

    Sum proximities       pij    =  pij + pji 

    Average proximity     pij     =  ( pij + pji )

    Minimum proximity   pij    =  min (pij, pji )

    Maximum proximity  pij    =  max (pij, pji )

Proximity List format can be symmetric or asymmetric.  The first 2 values on each line are case numbers (in any order), and the third value is the proximity for that pair of cases.  The list does not have to be exhaustive, as ClustanGraphics will assume that any omitted proximities correspond to maximum dissimilarity, or minimum similarity.  This type of format is useful for large, sparse matrices - for example, telephone calling traffic where the traffic between nodes is heavily localised.

Hint: Compute your proximities in a spreadsheet such as Excel or Lotus 1-2-3, then select the matrix and copy and paste the values into ClustanGraphics using Square Matrix format.  An example is distributed with ClustanGraphics.

We have also discussed Network Analysis using Clustan software in another web page.