The k-means clustering technique: General considerations and implementation in Mathematica 论文
摘要
Data clustering techniques are valuable tools for researchers working with large databases of multivariate data. In this tutorial, we present a simple yet powerful one: the k-means clustering technique, through three different algorithms: the Forgy/Lloyd, algorithm, the MacQueen algorithm and the Hartigan & Wong algorithm. We then present an implementation in Mathematica and various examples of the different options available to illustrate the application of the technique. Data clustering techniques are descriptive data analysis techniques that can be applied to multivariate data sets to uncover the structure present in the data. They are particularly useful when classical second order statistics (the sample mean and covariance) cannot be used. Namely, in exploratory data analysis, one of the assumptions that is made is that no prior knowledge about the dataset, and therefore the dataset’s distribution, is available. In such a situation, data clustering can be a valuable tool. Data