Scalability for clustering algorithms revisited 论文

2000ACM SIGKDD Explorations Newsletter引用 248

Algorithms and Data CompressionData Management and AlgorithmsAdvanced Clustering Algorithms Research

Algorithms and Data Compression Advanced Clustering Algorithms Research Data Management and Algorithms

作者

摘要

This paper presents a simple new algorithm that performs k-means clustering in one scan of a dataset, while using a buffer for points from the dataset of fixed size. Experiments show that the new method is several times faster than standard k-means, and that it produces clusterings of equal or almost equal quality. The new method is a simplification of an algorithm due to Bradley, Fayyad and Reina that uses several data compression techniques in an attempt to improve speed and clustering quality. Unfortunately, the overhead of these techniques makes the original algorithm several times slower than standard k-means on materialized datasets, even though standard k-means scans a dataset multiple times. Also, lesion studies show that the compression techniques do not improve clustering quality. All results hold for 400 megabyte synthetic datasets and for a dataset created from the real-world data used in the 1998 KDD data mining contest. All algorithm implementations and experiments are designed so that results generalize to datasets of many gigabytes and larger.

作者查看全部 (3)

Charles Elkan

James F. Lewis

Fredrik Farnstrom

Scalability for clustering algorithms revisited 论文

详细信息

摘要

作者查看全部 (3)

相关技术查看全部 (3)

相关事件

相关文章