Towards domain-independent information extraction from web tables 论文

2007引用 228

Web Data Mining and AnalysisAdvanced Database Systems and QueriesData Quality and Management

企业软件 Advanced Database Systems and Queries Data Quality and Management Web Data Mining and Analysis

作者

摘要

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of &lt;table&gt; tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of web pages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The thereby obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current “Visual Web.”

作者查看全部 (5)

Bernhard Pollak

Bernhard Krüpl

Marcus Herzog

Paul Bohunsky

Towards domain-independent information extraction from web tables 论文

摘要

作者查看全部 (5)

相关技术

相关事件

相关文章