Fast and intuitive clustering of web documents 论文

1997引用 267
Algorithms and Data CompressionWeb Data Mining and AnalysisData Mining Algorithms and Applications

摘要

Conventional document retrieval systems (e.g., Alta Vista) return long lists of ranked documents in response to user queries. Recently, document clustering has been put forth as an alternative method of organizing the results of a retrieval [4]. A person browsing the clusters can discover patterns that would be overlooked in the traditional ranked-list presentation. In this context, a document clustering algorithm has two key requirements. First, the algorithm ought to produce clusters that are easy-to-browse -- a user needs to determine at a glance whether the contents of a cluster are of interest. Second, the algorithm has to be fast even when applied to thousands of documents with no preprocessing. This paper describes several novel clustering methods, which intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. We report on experiments that evaluate these intersection-based clustering methods on collections of sn...