NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents 论文

1998引用 303

Web Data Mining and AnalysisAdvanced Database Systems and QueriesSemantic Web and Ontologies

企业软件 Semantic Web and Ontologies Advanced Database Systems and Queries Web Data Mining and Analysis

作者

摘要

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mi...

作者查看全部 (1)

Brad Adelberg

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents 论文

摘要

作者查看全部 (1)

相关技术

相关事件

相关文章