Clustering Large Datasets 

Home
About Clustan
Cluster Analysis
Applications
ClustanGraphics
User Support
Clustan/PC
Orders
What's New
White Papers
Contact Us
ClustanGraphics includes our super-fast procedure for clustering very large datasets by Ward's method or Average Linkage.  It uses our fast Cluster Data algorithm which works directly on your data and does not require a proximity matrix to be computed and saved.  It is very, very fast indeed!Hierarchical cluster analysis of 100,000 cases by 10 variables

In this example we clustered a survey of 100,000 cases by 10 variables, using Increase in Sum of Squares (or Ward's Method).  Admittedly, this was rather a large cluster analysis - more of a data mining production run in fact.  But as you can see, ClustanGraphics completed it in under 50 minutes on a regular Dell retail PC running Windows XP.

Notice that our cluster analysis required just 66 iterations - this compares with standard hierarchical clustering programs that complete n iterations, n being the number of cases.  Furthermore, as each iteration involves searching the proximity matrix, which contains n2 proximities, the complexity of the standard procedure is n3 .  By comparison, our unique algorithm substantially reduces both the computation time and the storage required to analyze large data mining problems.

The n2 proximity matrix for 100,000 cases comprises 10,000,000,000 cells - which would require 40Gb for storage alone, hardly a feasible proposition.  It's therefore easy to see why the standard programs run out of computer resources and time long before they can tackle this size of application.  By comparison, the process monitor for ClustanGraphics revealed that 40MB of memory was allocated to it, less than a sixth of the PC's memory, with no disk activity recorded during computation.

When you have completed the analysis, you can save your tree or compute cluster memberships for any section, and these can be saved for further analysis, e.g. in Excel.  However, a tree of 100,000 nodes can be somewhat difficult to display, so we usually truncate it to a manageable number of final clusters which conveniently represents the cluster model of interest. 

The following screen-shot shows the tree of 100,000 cases truncated to the last 50 clusters, where the numbers in square brackets are the cluster sizes.  The summary model can now be profiled for any section of the tree from 50 to 2 clusters, and the summary tree can be saved as a ClustanGraphics file for further analysis.  For example, you can use it with the ClustanGraphics procedure Classify to identify the cluster of best fit for new cases by reference to the model.

Summary tree for a hierarchical cluster analysis of 100,000 cases, sectioned at the 10-cluster solutionHere we display the summary tree for 50 clusters, shaded in yellow at the 10-cluster section.  Each cluster is labelled by a case member, with the cluster sizes in brackets (average 2000 cases per cluster).  By pointing the cursor at a section and clicking the mouse you select any sub-model for profiling.  In this example, the 10-cluster section was selected because it was the highest partition identified by Best Cut as significant.

We can now create a cluster model of any section of the tree into 50 or fewer clusters.  Options include profiling cluster means and membership, or plotting the clusters against explanatory variables such as principal components.  A section of the tree can also be used as a starting classification for k-means, or to identify outliers, or to classify new cases.

We ran several ClustanGraphics timing trials on different sized data matrices, each with 10 variables (columns) and different numbers of cases, with the following results:Hierarchical cluster analysis of 100 cases by 40,000 variables (DNA sequence data)
10,000 cases - 30 seconds
20,000 cases - 105 seconds
40,000 cases - 7 minutes
60,000 cases - 16 minutes
80,000 cases - 28 minutes
100,000 cases - 50 minutes
200,000 cases - 3 hours

ClustanGraphics will cluster datasets smaller than 5,000 cases almost instantaneously - check that for performance against the competition!  Even if you don't have clustering applications this large at present, it's reassuring to know that ClustanGraphics is scable when the really big ones come along.  These times were obtained using a Dell Dimension 4500 with a 2.5GHz Pentium processor and 256MB memory running Windows XP.

One of our users maps geographic regions on remotely sensed satellite data.  Each map typically comprises around 80,000 cells or pixels, each with 44 measurements - that's a dataset of over 3m cells!   The cells are clustered hierarchically by Average Linkage (UPGMA) and the tree is saved so that any level of classification from coarse to fine can be instantly displayed graphically in a viewer - details here.  Apparently he waited years for an affordable clustering package that can do this!  For reference, "Package Plod" - our generic term for the competition - would need 25GB to store the proximity matrix alone - not really a practical proposition.

What if I have more than 10 variables, you ask?   With ClustanGraphics, it's a breeze!  There are no fixed limits to the number of variables or the number of cases because our internal working arrays are dimensioned dynamically.  The operational limit is the overall size of data matrix you can store in the memory available on your PC.  Although the above tests were completed on a PC with 256MB of memory, we have also run large datasets on a basic Pentium III PC with 32MB, and you can easily do much more on a higher-specification power desktop PC.  See Reading Large Datasets for further details.

Another of our current large-scale applications involves classifying DNA micro-arrays which can comprise 40,000 gene chips per sequence.  ClustanGraphics clustered a data matrix of 100 cases by 40,000 variables hierarchically in 12 seconds (see above).

How do we do it?  Please don't ask!  We've been plagiarized too many times to reveal how our best algorithms work.  We know that our competition struggle to classify more than about 5,000 cases by hierarchical cluster analysis, and we're not ready to help them out again.  Simply order ClustanGraphics and test it for yourself.

To order ClustanGraphics on-line click ORDER now.