Clustering Variables 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
A feature which is rather popular amongst some of our users is to cluster variables instead of cases.  Compared to factor analysis, clustering variables identifies the key variables which explain the principal dimensionality in the data, rather than abstract factors; allows much larger correlation or covariance matrices to be analyzed; and greatly simplifies interpretation.

In this example we clustered a matrix of correlation coefficients for test scores by a sample of 220 Scottish pupils on six school subjects, copied from Lawley and Maxwell* .  The correlation matrix was read into ClustanGraphics as a square proximity matrix - see Reading Proximities.  We also had to declare that the matrix contained proximities of type similarity (i.e. product-moment correlation coefficients).
 

Gaelic

English

History

Arithmetic

Algebra

Geometry

Gaelic

1.000

0.439

0.410

0.288

0.329

0.248

English

0.439

1.000

0.351

0.354

0.320

0.329

History

0.410

0.351

1.000

0.164

0.190

0.181

Arithmetic

0.288

0.354

0.164

1.000

0.595

0.470

Algebra

0.329

0.320

0.190

0.595

1.000

0.464

Geometry

0.248

0.329

0.181

0.470

0.464

1.000

The correlation matrix was next clustered hierarchically by complete linkage.  The result is a set of hierarchically nested clusters such that all the variables within each cluster are inter-correlated with each other at a level determined by the smallest correlation coefficient in the cluster.  The complete linkage tree has the following values:

    First Cluster     Second Cluster      Fusion Value
      Arithmetic         Algebra              0.595
      Arithmetic         Geometry             0.464
      Gaelic             English              0.439
      Gaelic             History              0.351
      Gaelic             Arithmetic           0.164

We can see that at the two-cluster level, the subjects Arithmetic, Algebra and Geometry are all inter-correlated at a value of 0.464, or higher; and that Gaelic, English and History are inter-correlated at 0.351, or higher.  The two-cluster level, illustrated below, neatly separates the "verbal" subjects from the "mathematical" subjects.

We also took the opportunity to optimally re-order the tree so that the sequence from top to bottom makes most presentational sense.  ClustanGraphics can display a shaded version of the correlation matrix in which the two cluster of variables are highlighted in green:

These results are very similar to what was obtained by factor analysis (see source, below), though we would venture to suggest that interpretation is much easier when the subjects are clustered.  Our cluster analysis on variables is a discrete form of factor analysis, where each variable belongs to only one cluster (factor).  Lawley and Maxwell state that the analysis demonstrates that individuals who do well on verbal subjects tend to do less well on mathematical subjects, and vice versa.  We agree.

This application may appear trivial with only six variables, as the factoring of the correlation matrix can be done by inspection.  But it would not be so obvious if there were 50 variables, or 5000 variables as can occur in, for example, gene expression studies.  Clustering correlation matrices of this size is quite feasible using ClustanGraphics, and can lead to helpful insight into the structure of the data which is not so straightforward to analyze using factor analysis.  Indeed, it is not really practicable to use factor analysis with more than about 100 variables, because of the cost of inverting the correlation or covariance matrix.

ClustanGraphics also finds the cluster exemplars, or the most typical members of each cluster.  In this context, the exemplar is that variable which has the highest average correlation with the other variables in the cluster.  At the two-cluster level illustrated above, the cluster exemplars are Gaelic and Arithmetic.  These are the two key subjects which, on the basis of the scores of the 220 Scottish pupils analyzed, most exemplify the verbal versus mathematical dichotomy in the data.

A large matrix containing inter-correlated variables can thus be reduced to a more focussed subset of key variables which account for the principal dimensionality in the data, without resorting to the more abstract, and hence less meaningful, transformation to factor scores obtained by factor analysis.

* Factor Analysis as a Statistical Method, by D N Lawley and A E Maxwell, Butterworths 1971, p. 66.