Knowledge method to better vision -language models

Large machine learning models based on the transformer architecture have recently demonstrated extraordinary results about the range of vision and language tasks. But so large models are often too slow for real -time use, so practical systems are often dependent on knowledgeillation to distill large models’ knowledge to slimmer, faster models.

The defining characteristic of the transformation model is its dependence on Attention MéchanismsThe influence that previously seen data must have on the model’s handling of the current data. Attention mechanisms are typically organized in several HeadsWhich each is waiting for another aspect of the data.

Typically involves big-transformer distillation that adjusts the attention of the attention of the large, trained model-the learn – With the slimmer, target model – the target model Student -We have a one-on-one basis. But limiting the number of attention heads is one of the way the student model can reduce model complexity.

At this year’s meeting of the Association for the Advancement of Artificial Intelligence (AAAI), we suggested an alternative where knowledge of all attention heads in the teacher model is distilled in all attention heads to the student’s model. The sale of the student has fewer heads than the teacher, a single attention head in the student’s model may end up getting information contained in several of the teacher’s attention head.

Related content

New architectures and carefully prepared training data enables advanced performance.

We evaluated for access to two different vision-Langue models that map images and texts to the same vector space. The models had been fine-tuned on visual-question-saying assignment, an image capacation task and a translation task based on image context, and we compared our distillation method with two advanced basic lines. Our Appracach outperformed the base lines everywhere.

Target tasks

Typically, a Vision-Language Model (VLM) has a separately retrieved sub-module for each of its modalities, and the white network is then further prepared for multimodal reproduction. Finlly is the lintrened model then set on a specific task.

In our experience, we distilled students’ model only on the fine -tuned assignment. However, we considered the case in which the teacher model did not Has any multimodal pre -entering and found that our distillation method could largely compensate for this deficiency.

Weight games

For a given input or set input, each attention chief of a transformation that builds an attention card, a matrix indicating the influence that each element of input exerts on each of the other elements. In a llm, the attention card maps the words in a text sequence against himself; When deciding on each new output word, LLM uses the attention of the attention in the Matrix column similar to this word. In a vision model, the card may take the influence that each region of an image exerts on the interpretation of any other region.

A chart with three arrays, deposition image embedders on the left, with arrows extending to a neural network and additional arrows that extend to nodes marked "hair" (for attention) on the other side of the network. The network's output also connects to all three attention nodes, giving each attention hube two arrows.

Related content

Attention-based representation of multi-enhancements improves the performance of downstream vision-language tasks.

The rows of any matrix can be concrete for producing a single vector, and our approach to the Distillation – rescues on the vector versions – or “flattened” versions – of precautions.

The loss function for the distillation process has two components. One is a feature that seeks to minimize the difference between the teacher and the stud outputs; Obviously, it is important that the student reproduces the functionality of the teacher model as accidentally as possible. The second component of the tab function adjusts attention card.

Specifically, for a given training example and a giving attention head in the teacher model, the attention card-adjustment loss is seeking to minimize the distance between the teacher’s attention card and a weighted sum of the cards generated by all Student’s attention leader.

Schematic comparison of conventional attention head nowledge distillation (right) and our approval, take care of card adjustment distillation (Amad). In the conventional approach, each teacher’s attention is MAPD to exactly a student head; Extra teacher heads are simply discarded. In our approval, each teacher’s head is mapped for more students heads on a weight fashion. The thickness of the colored lines illustrates the weights.

The weights of the weighted sum are based on the Kosinus similartities between the flattered teacher card and the maps of the flared students. In other words, the student’s cards count, which already looks like the teacher card more against the weighted sum. Over successive steps in the training process, the similarity must increase, and so should the weights assigned to the similar students’ cards.

If the student had exactly the same number of attention heads as the teacher and there we are no correction of the cards generated by the teacher’s precautionary heads, this process can result in something similar to the one-to-oone mapping of the standard distillation process. But of race is the point of the procedure of preserving information about attention cards, even when the student has been a head of attention than a teacher.

And empirically, there is usually some connection between attention cards generated by different ones. In fact, these correlations may explain the success of our method; This is because the several attention cards generated by the teacher can be distilled to a single card generated by the student.

Recognitions: Srikar Appalaraju, Money Tang, Vijay Mahadevan, R. Manmatha, Ying Nian Wu.

Leave a Comment