An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. 论文

1997引用 354

Advanced Data Storage TechnologiesCryptography and Data SecurityPrivacy-Preserving Technologies in Data

网络安全 Advanced Data Storage Technologies Cryptography and Data Security Privacy-Preserving Technologies in Data

作者

摘要

Detecting database records that are approximate duplicates, but not exact duplicates, is an important task. Databases may contain duplicate records concerning the same realworld entity because of data entry errors, because of unstandardized abbreviations, or because of differences in the detailed schemas of records from multiple databases, among other reasons. In this paper, we present an efficient algorithm for recognizing clusters of approximately duplicate records. Three key ideas distinguish the algorithm presented. First, a version of the Smith-Waterman algorithm for computing minimum edit-distance is used as a domainindependent method to recognize pairs of approximately duplicate records. Second, the union/find algorithm is used to keep track of clusters of duplicate records incrementally, as pairwise duplicate relationships are discovered. Third, the algorithm uses a priority queue of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as...

作者查看全部 (2)

Charles Elkan

Alvaro Monge

An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. 论文

摘要

作者查看全部 (2)

相关技术

相关事件

相关文章