In the digital era where documents are generated and distributed to hanged speeds, it is crucial to understand them automatically. Consider the tasks by extracting payment information from invoices or digitizing historical items where layouts and handwritten notes play an important role in understanding context. These scenarios emphasize the complexity of document understanding, which not only requires to recognize text but also interpret visual elements and their space conditions.
At this year’s meeting in Association for Advancement of Artificial Intelligence (AaiAI 2024) we suggested a model we call DOCFORMV2, which not just read Documents understands They make sense of both textual and visual information in a way that mimics human understanding. For example, like a person can derive the key points of a report from its layout, headings, text and associated tables, DOCFORMER2 analyzes these items together to the greasy document’s overall message.
In contrast to its previous Local features Within documents – small, specific details such as the style of a font, the way a section is arranged, or how images are placed next to text. This means that it can distinguish the importance of layout elements with high batteries than previous models.
A prominent feature of DOCFORMV2 is its use of self-undergone learning, the approach used in many of today’s most successful AI models, such as GPT. Self -sensible learning unannounced data that enables education in huge public data sets. In language modeling, for example, the next token prediction (used by GPT) or masked token prediction (used by T5 or BERT) is popular.
For DOCFORMV2, in addition to standard masked token-predicts, we offer two additional tasks, Token-to-line Prediction and Token-to-grid Task. These tasks are designed to elaborate on the model’s understanding of the intrigate relationship between text and its space arrangement within documents. Let’s take a closer look at them.
Token to line
The token-to-line task trains DOCFORMV2 to recognize how text elements match in lines, which is necessary for an understanding that goes beyond mere words to include text and structure of text as it appears in documents. This follows the intuition that most of the information needed for key value prediction in a form or for visual question answers (VQA) is on either the same line or adjacent lines in a document. For example, in the chart below to predict the value of “total” (Box a)The model should look in the same line (Box D, “$ 4.32”). Through this type of task, the model learns to give imports to information about the relative positions of tokens and its semantic implications.
Token to grid
Semantic information varies across a document’s different regions. For example, financial documents may have headings at the top, fulfillable information in the middle and side feet or instructions at the bottom. Page numbers are usually found at the top or bottom of a document, while company names in receipts or vovices often appear at the top. Understanding a document requires exactly to recognize how its content is organized with a specific visual layout and structure. Armed with this intuition peers toers token-to-grid assignment semantics for texts with their locations (visual, spatial or both) in the document. Specifically, a grid is superriped on the document and each OCR token is awarded to the net number. During training, DOCFORMV2 has the task of predicting the grid number for each token.
Target tasks and influence
On nine different data sets, which cover a number of documentary tasks, DOCFORMERV2 exceeds previous models in comparable size and makes even better than much larger models-Inclusive one that is 106 times the size of DOCFORM2. Sentle text from documents is extracted using OCR models that make prediction errors, we also show that DOCFORMV2 is more resistant to OCR errors than its predecessor.
One of the tasks we trained DOCFORMV2 is Table VQA, a challenging task where the model Mustwer asks about tables (with EITH images, text or both as input). DOCFORMV2 achieved 4.3% absolute performance improvement compared to the next best model.
But DOCFORMV2 also showed several benefits over its predecessors. Becuse It is trained to make sense of local features, DOCFORMV2 can answer correctly when asked questions such as “which of these stations do not have a ‘in their call signal?’ Now “How many of the schools do the Roman -Catholic Diocese of Cleveland serve?” (The second question requires counting – a hard skill to learn.)
To show the versatility and generalizability of DOCFORMV2, we also tested it on stage text VQA, a task related to goals separated from document understanding. Again, it surpasses marked predecessor in comparable size.
While DOCFORMV2 has made significant strips to interpret complex documents, several challenges and exciting opportunities are ahead, such as teaching the model to deal with various document layouts and improve multimodal integration.