Reading Large Datasets 

ClustanGraphics can read a dataset of any size.  There are no fixed limits to the number of variables or the number of cases because our working arrays are dimensioned dynamically.  The operational limit is the overall size of data matrix you can store in the memory available on your PC.

To demonstrate this we successfully read a data matrix of 100,000 rows and 20 columns, or 2m cells in total, as follows:

Reading a million cases

We also completed a hierarchical cluster analysis on this data matrix, using a Pentium II 400MHz processor.  See Clustering Large Datasets.

In another study, we had a very large number of variables.   The following screen confirms ClustanGraphics reading a dataset of 50 rows and 40,000 columns, or 2m cells in total.

Reading data with 40,000 variables

We went on to produce a hierarchical cluster analysis on this dataset in under 3 minutes.  The results are shown in Clustering Large Datasets, which describes how ClustanGraphics has been used to produce a hierarchical cluster analyses on various datasets up to 100,000 rows and up to 40,000 columns.  This far exceeds the analytical scope of any other published software.  It means that ClustanGraphics is the perfect choice for clustering large survey and data mining applications.

Don't expect to be able to calculate a proximity matrix for a dataset of 100,000 cases.  It's simply not technically feasible, for the reasons set out in Very Large Surveys and Databases.   This also explains why our Cluster Data Matrix procedure is way ahead of what other packages can offer.  However, you can calculate a proximity matrix in ClustanGraphics for 5000 cases or more, and these can then be analyzed by Cluster Proximities .

When reading space-delimited files, ClustanGraphics has to examine every character and this can take some time.  We therefore offer an option to read large datasets from binary files.  For example, the following dataset consists of 5000 cases by 1000 variables, and has a size of over 20MB:

Reading a data matrix of 5m cells

This file was read on a Pentium II 400MHz PC in about 3 seconds by ClustanGraphics, and a k-means analysis on it took under 100 seconds.

We would also mention here that ClustanGraphics can read Excel workbooks and paste data from any spreadsheet or word processor.

Banking describes a data mining application in which we obtained a tree for a sample of 16,000 cases which was then used to classify the 4.2m customers of a bank.  This is an example of large-scale customer segmentation in marketing, an area in which we offer academic research and technical expertise.

If you are using, or have used, a CHAID algorithm to split large datasets, you might care to read our critique Clustering versus Decision Trees .

To order ClustanGraphics on-line click ORDER now.