Impact of Similarity Measures on Web-page Clustering 论文

2000引用 663

Text and Document Classification TechnologiesComplex Network Analysis TechniquesAdvanced Clustering Algorithms Research

Advanced Clustering Algorithms Research Text and Document Classification Technologies Complex Network Analysis Techniques

关系图谱

作者

摘要

Clustering of web documents enables (semi-)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as Yahoo that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, self-organizing feature map, hyper-graph partitioning, generalized k-means, weighted graph partitioning), on high dimensional sparse data rep...

作者查看全部 (3)

Raymond J. Mooney

Joydeep Ghosh

Alexander L. Strehl

Impact of Similarity Measures on Web-page Clustering 论文

摘要

作者查看全部 (3)

相关技术查看全部 (3)

相关事件

相关文章