Parallel crawlers 论文

2002引用 314

Web Data Mining and AnalysisAlgorithms and Data CompressionCaching and Content Delivery

Algorithms and Data Compression Web Data Mining and Analysis Caching and Content Delivery

作者

摘要

In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture.

作者查看全部 (2)

Héctor García-Molina

Junghoo Cho

Parallel crawlers 论文

摘要

作者查看全部 (2)

相关技术查看全部 (1)

相关事件

相关文章