NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents 论文

1998引用 303
Web Data Mining and AnalysisAdvanced Database Systems and QueriesSemantic Web and Ontologies

摘要

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mi...

相关技术

暂无数据

相关事件

暂无数据

相关文章

暂无数据