**Two-Stage k-Means Clustering Method**
**
with Outlier Detection and Model Calibration**
**D. Wishart**^{1
}, S. Simmons^{2}
**1Department of Management, University of St. Andrews;
**^{2}McKinsey & Co, London
*Keywords**: ClustanGraphics, cluster analysis, conjoint surveys, data mining, k-means,
knowledge management, market segmentation, outliers*
The paper describes a two-stage k-means clustering method to search for an optimum cluster
model. The first stage provides six different starting strategies, randomized trials, differential variable and case weights, outlier deletion, and saving and profiling of the top solutions. In the
second stage, outliers are re-assigned to clusters and the model is calibrated. Six starting strategies are provided - tree partition, dense cliques, cluster exemplars, contingency table, initial points and
random assignment. Any combination of these can be used to specify the initial search partitions for k-means analysis. The
Euclidean Sum of Squares is optimized, and it will be shown that the algorithm always converges if allowed sufficient iterations.
Outliers and intermediate cases can be excluded from the model, thereby tightening the core clusters and separating densely populated segments.
The variables used to specify the cluster model can be selected from the input data matrix and standardized or assigned
differential weights. Cases can also be assigned differential weights, and random split samples can be selected for cluster
model validation. Weighting of cases is particularly important for aggregated units, such as geographical regions weighted by their population sizes, or customers weighted by their transaction values or volumes.
A specified number of the top convergent solutions can be saved, together with their criterion values and frequencies of
occurrence. Any of these top solutions can then be used as a core cluster model for second-stage calibration, evaluation, and case assignment.
Second-stage model calibration involves selecting the subset of the best discriminatory variables for the core cluster model;
weighting variables by t-tests on cluster means or f-tests on cluster variances, to focus the classificatory sub-space;
summarizing the model by hierarchical cluster analysis; optimal ordering of the core clusters; and assigning outliers and
intermediate cases. The model can then be used to classify new cases, such as in data mining applications of any size.
The paper will be illustrated by applications in market segmentation. New software for Windows 95/2000 and NT will be demonstrated. **References**
WISHART, D. (1998): ClustanGraphics3: Interactive Graphics for Cluster Analysis, in Gaul, Locarek-Junge (Eds.):
Classification in the Information Age, Proceedings of the 22nd Annual Conference of the Gesellschaft für Klassifikation, University of Dresden, Springer, Heidelberg, pp 268-275. |