Analyzing Large Surveys 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
If you are involved in data mining or analyzing large social surveys, remember that our k-means analysis can handle different types of variables, such as occur in survey questionnaires and database records.  You are not likely to find similar flexibility in other clustering or neural network software.

As you should know by now, a unique ClustanGraphics feature is its ability to complete hierarchical cluster analysis on very large datasets by clustering directly on your data.  In the following example, we clustered a large sample of 16,000 cases by Ward's Method in under 3 minutes, using a basic PC.  [Pause to reflect on the competition!]

Clustering 16,000 cases hierarchically

However, a tree this large may only be partially optimum.  Using k-means analysis in ClustanGraphics you can further optimize a cluster model and thus clarify your classification profile.  In this example, we ran k-means using the 40-cluster partition from Cluster Data as initial cluster centres.

kMeansAnalysis on 16,000 cases

We opted to tighten the 40 clusters by removing outliers.  These are cases remote from the clusters - in this example, cases at distances of more than 3.00 from any cluster centre.  The analysis reduced the Euclidean Sum of Squares (ESS ) by 17%.  Of the 16,000 cases in the original sample 1,144 outliers were removed, thus leaving 14,856 cases grouped into 40 tight clusters each having a radius (in 5 dimensions) of no more than 3. 

ClustanGraphics k-means converged in 81 iterations, taking just 16 seconds on a basic PC.  Cluster exemplars were computed - the 40 cases nearest to each cluster centre.  They can be extracted, if you wish, as specimens which exemplify the full diversity of your data.  Just select exemplars as your final cluster centres before truncation.  [Reflect!]

Click k-means Statistics, and the following choices are offered.  They allow you to check the size, location (in 5 dimensions) and exemplar of each cluster, its contribution to the Euclidean Sum of Squares and how it compares with the other clusters.

kMeans Statistics

With t-tests on variables, you can easily explore the contributions of your variables to each cluster profile.  These illustrate large positive or negative departures from the sample mean on key discriminating variables. 

You can also examine the classification of each case individually - the cluster to which it has been assigned, the distance from its centre, or its distances from all of the clusters.  This may be useful, for example, to sort all the cases according to their nearest cluster distances, or to investigate outliers - sometimes the rare nuggets are more interesting than the bulk data obscuring them (c.f. mining fraudulent insurance claims). 

All k-means results can be displayed, output to a file, or copied to a document for publication or a spreadsheet for further analysis.

Next we truncated our model to the 40 final clusters found by k-means.  This had the beneficial effect of substantially reducing the size of the model - from 16,000 cases to 40 clusters - making it much more amenable to publication (see below).

A proximity matrix of Euclidean Sum of Squares was computed between clusters, and a summary tree obtained using Increase in Sum of Squares (Ward's Method), as shown below.  This allows any cluster level from 2 to 40 to be studied in detail, using the full panoply of ClustanGraphics features applicable to small datasets.  In the example, the 9-cluster level has been highlighted and the clusters are labelled by their exemplars and sizes [in brackets].  You can, of course, substitute more descriptive labels when you profile your classification or publish it.   [Reflect again!]

Summary Tree for 40-cluster k-means model

We need to mention that our k-means procedure can be used on very large datasets, without first calculating a tree.  For example, we have run 1m cases on a basic PC.  This size of dataset is not amenable to hierarchical cluster analysis, even using our super-fast direct data clustering method.

We also offer a crow's nest view of the tree, which can be edited; for further details on this go to Navigate k-Means.

With very large datasets, it's necessary to specify the initial cluster centres in other ways.  ClustanGraphics lets you choose from k initial cases; a random assignment of all the cases to k initial clusters; reading k cluster centres from a file; or pasting cluster centres from another application, such as a spreadsheet.

Of course, you can also apply your k-means model to classifying new data.  Just save it, and run Classify.   This allows an unlimited number of new cases to be compared with your cluster model and assigned to their nearest clusters, at any level in the final tree. 

You can distribute your model to co-workers - just upload it to a website, or attach it to your email messages.  It's compact and self-contained, with all the truncated model data, case labelling and transformation parameters they need to run your model on their own data.  This presupposes, of course, that they too have ClustanGraphics.

To find out more, you and your co-workers will just have to become users - so ORDER ClustanGraphics on-line now.  To download the published abstract of this paper as a zip file (100k), click here.