New prior tasks enable better document understanding

In the digital era where documents are generated and distributed to hanged speeds, it is crucial to understand them automatically. Consider the tasks by extracting payment information from invoices or digitizing historical items where layouts and handwritten notes play an important role in understanding context. These scenarios emphasize the complexity of document understanding, which not only requires to recognize text but also interpret visual elements and their space conditions.

Visual Document Understanding (VDU): An excerpt of a document receipt from the DOCVQA data set. A VDU Model may be asked to predict the “SOLD to the” address (visual question answer) to predict all relationships (“sold to” →

“Send to” →

) or to derive information from the table at the top of the document.

At this year’s meeting in Association for Advancement of Artificial Intelligence (AaiAI 2024) we suggested a model we call DOCFORMV2, which not just read Documents understands They make sense of both textual and visual information in a way that mimics human understanding. For example, like a person can derive the key points of a report from its layout, headings, text and associated tables, DOCFORMER2 analyzes these items together to the greasy document’s overall message.

These schedules compare conventional attention -headed nowledge distillation (right) and a new approach, watch out for card adjustment (AMAD) to the left. The image contains a series of 3 with 3 grids with labels such as head 1, head 2 and head 3. Each grid has some colored squares and arrows with different thicknesses and colors connecting some of the grid. The grids on the right show the conventional attention to attention to attention -known and the grid on the left shows the new approach.

Token to line

The token-to-line task trains DOCFORMV2 to recognize how text elements match in lines, which is necessary for an understanding that goes beyond mere words to include text and structure of text as it appears in documents. This follows the intuition that most of the information needed for key value prediction in a form or for visual question answers (VQA) is on either the same line or adjacent lines in a document. For example, in the chart below to predict the value of “total” (Box a)The model should look in the same line (Box D, “$ 4.32”). Through this type of task, the model learns to give imports to information about the relative positions of tokens and its semantic implications.

New Pretraining Document Tasks: Token to Line and Token to Grid.

Token to grid

Semantic information varies across a document’s different regions. For example, financial documents may have headings at the top, fulfillable information in the middle and side feet or instructions at the bottom. Page numbers are usually found at the top or bottom of a document, while company names in receipts or vovices often appear at the top. Understanding a document requires exactly to recognize how its content is organized with a specific visual layout and structure. Armed with this intuition peers toers token-to-grid assignment semantics for texts with their locations (visual, spatial or both) in the document. Specifically, a grid is superriped on the document and each OCR token is awarded to the net number. During training, DOCFORMV2 has the task of predicting the grid number for each token.

Target tasks and influence

On nine different data sets, which cover a number of documentary tasks, DOCFORMERV2 exceeds previous models in comparable size and makes even better than much larger models-Inclusive one that is 106 times the size of DOCFORM2. Sentle text from documents is extracted using OCR models that make prediction errors, we also show that DOCFORMV2 is more resistant to OCR errors than its predecessor.

One of the tasks we trained DOCFORMV2 is Table VQA, a challenging task where the model Mustwer asks about tables (with EITH images, text or both as input). DOCFORMV2 achieved 4.3% absolute performance improvement compared to the next best model.

For the question “Which of these stations does not have a ‘k’ in its calling signal?”, DOCFORMV2 correctly answers “WNAX-FM” (fourth row, second column). This requires reasoning of space, visual and language functions.

To the question “How many of the schools does the Roman -Catholic Diocese of Cleveland?”, DocformV2 years “Fire”. This requires arithmetic count – a challenging task for machine learning models – and resonance over multiple rows.

In this example, an image and text (from an OCR model) to DOCFORMV2, along with the question “What color is the word ‘police’ is written in?”. Due to its multimodal character, DOCFORMV2 can “see” the image and answer correctly “white”.

But DOCFORMV2 also showed several benefits over its predecessors. Becuse It is trained to make sense of local features, DOCFORMV2 can answer correctly when asked questions such as “which of these stations do not have a ‘in their call signal?’ Now “How many of the schools do the Roman -Catholic Diocese of Cleveland serve?” (The second question requires counting – a hard skill to learn.)

To show the versatility and generalizability of DOCFORMV2, we also tested it on stage text VQA, a task related to goals separated from document understanding. Again, it surpasses marked predecessor in comparable size.

While DOCFORMV2 has made significant strips to interpret complex documents, several challenges and exciting opportunities are ahead, such as teaching the model to deal with various document layouts and improve multimodal integration.

Leave a Comment Cancel reply