Classifying New Cases 

About Clustan
Cluster Analysis
User Support
What's New
White Papers
Contact Us
ClustanGraphics includes Classify New Cases.  This allows new cases to be classified by reference to any hierarchical or k-means classification.  Any size of data can be classified because Classify Cases requires a single pass through the data, which can be read sequentially from a text file - see banking for a case study in Customer Relationship Management which involved over 4 million cases.

Classify New Cases allows for missing values if present, and can be used with mixed data types - continuous, binary, ordinal or nominal variables.  Such variables are typically found in complex survey questionnaires, for which it is generally not necessary to carry out any tedious pre-processing to transform such data.  ClustanGraphics takes care of these transformations internally, leaving you more time to focus on the post-clustering analysis.

In the example shown below, we are classifying new cases at the 6 cluster level in a tree generated by Cluster Data for 500 cases.  There were 4 variables in the original dataset, so we need to specify 4 values for each new case.   We could enter the new cases interactively as directed.  But since we have quite a few to enter we put them on an input file which has already been specified by clicking "Input File".

We also specified an output file, and selected the model's results to be saved for further analysis.  The Classify screen shows the data for case 11 from our input file, and its proximities to each of the last 6 clusters in our tree.  There's a neat bar chart which helps quickly confirm that cluster 2 is the best fit for case 11 at a proximity of 3.24, and we can also see that the case is next closest to cluster 4 at a distance of 9.38.

If we have a lot of data on our input file, we can click "Run Model".  This takes the model out of interactive mode, and simply runs it for the rest of the input file, posting the classifying results to the output file.  The resulting output file looks like this:

Case    Cluster Distances       
10         4    1.126   30.020  16.084  17.352  1.126   14.670  14.810  
11         2    3.240   21.470  3.240   14.352  9.376   16.620  20.960  
12         3    3.852   6.770   18.240  3.852   13.103  19.020  14.360  
13         2    6.178   21.670  6.178   38.352  31.376  27.220  21.960  
14         1    1.070   1.070   11.990  10.102  19.921  17.720  12.060  
..... etc

We're showing only 5 rows here from what is actually a much longer output file.  Case 11 is seen to be nearest to cluster 2 at proximity 3.24.  In addition, we chose to write all the proximities between the cases and the 6 clusters to the output file.  Specifying the output file for Classify New Cases is quite flexibly done, as shown below:

Once the output file has been created, it can be easily copied into a spreadsheet for further analysis, as below:

We opened the Classify output file "ClassifyResults.txt" directly in Excel and then sorted the cases on Cluster and Distance.  To save space, we have truncated the table to 5 of the 50 rows, including case 11 which is shown as classified in cluster 2 at a distance 3.24.  You're now set to do any further analysis you want on the results.

Classify Cases can be run on any size of dataset comprising either standard continuous data or mixed data types .   See banking for an example of a customer segmentation study on 4.2m cases.