DocFormer: End-to-End Transformer for Document Understanding 论文

20212021 IEEE/CVF International Conference on Computer Vision (ICCV)引用 283

Handwritten Text Recognition TechniquesMultimodal Machine Learning ApplicationsDomain Adaptation and Few-Shot Learning

人工智能 Domain Adaptation and Few-Shot Learning Multimodal Machine Learning Applications Handwritten Text Recognition Techniques

关系图谱

作者

摘要

We present DocFormer - a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

作者查看全部 (5)

R. Manmatha

Yusheng Xie

Bhargava Urala Kota

Bhavan Jasani

DocFormer: End-to-End Transformer for Document Understanding 论文

详细信息

摘要

作者查看全部 (5)

相关技术查看全部 (3)

相关事件

相关文章