Cluster-based language models for distributed retrieval 论文

1999引用 296
Information Retrieval and Search BehaviorTopic ModelingWeb Data Mining and Analysis

摘要

Effective retrieval in a distributed environment is an important but difficult problem. Lack of effectiveness appears to have three causes. First, collection selection based on word histograms is not appropriate for heterogeneous collections. Second, relevant documents are scattered over many collections and searching a few collections misses many relevant documents. Third, most existing collection selection metrics lack sound theoretical justifications and hence may not be well tuned to the problem. We propose a new approach to distributed retrieval based on document clustering and language modeling. Document clustering is used to organize collections around topics. Language modeling is used to properly represent topics and effectively select the right topics for a query. Based on these ideas, three methods are proposed to suit different environments. We show that all three methods improve effectiveness of distributed retrieval. 1 Introduction Information has become highly distribut...