Survey of Post-OCR Processing Approaches 论文

2021ACM Computing Surveys引用 251

Handwritten Text Recognition TechniquesNatural Language Processing TechniquesWeb Data Mining and Analysis

Natural Language Processing Techniques Handwritten Text Recognition Techniques Web Data Mining and Analysis

作者

摘要

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

作者查看全部 (4)

Antoine Doucet

Mickaël Coustaty

Adam Jatowt

Thi Tuyet Haï Nguyen

Survey of Post-OCR Processing Approaches 论文

详细信息

摘要

作者查看全部 (4)

相关技术查看全部 (2)

相关事件

相关文章