An adaptive model for optimizing performance of an incremental web crawler 论文
摘要
This paper outlines the design of a web crawler implemented for IBM Almaden's WebFountain project and describes an optimization model for controlling the crawl strategy. This crawler is scalable and incremental. The model makes no assumptions about the statistical behaviour of web page changes, but rather uses an adaptive approachtomaintain data on actual change rates which are in turn used as inputs for the optimization. Computational results with simulated but realistic data show that there is no `magic bullet' - different, but equally plausible, objectives lead to conicting `optimal' strategies. However, we nd that there are compromise objectives which lead to good strategies that are robust against a number of criteria. Categories and Subject Descriptors H3.4 [Systems and Software]: Performance Evaluation (eciency and eectiveness); H4.3 [Communications Applications ]: Information Browsers; G1.6 [Optimization]: Nonlinear Programming General Terms Algorithms, Experimentation, Performance Keywords Crawler, incremental crawler, scalability, optimization # This work was completed while the author was on leaveat IBM Almaden Research Center. Copyright is held by the author/owner. WWW10, May 1-5, 2001, Hong Kong. ACM 1-58113-348-0/01/0005. 1.