Remote-Object grounding Is the task of automatically determining where in the local environment to find an object specified in natural language. It is an important capacity for household robots that must be able to perform commands such as “Bring me the pair of glasses on the counter in the children’s bathroom.”
In a paper we present at the International Conference on Intelligent Robots and Systems (IROS), my colleagues and I describe a new approach to grounding that utilizes a foundation model — a large, self-monitored model that learns common representations of languages ​​and images. By treating the anchoring of remote control as a problem with obtaining information and using a “bag with tricks” to customize the foundation model to this new application, we enable an improvement of 10% compared to the latest art on a benchmark data set and a 5% improvement of another.
Language-and-Vision models
In recent years, foundation models – such as large language models – have revolutionized several branches of AI. Foundation -Models are usually trained through Masking: Items in the input data – whether text or images – are masked out and the model must learn to fill in the gaps. Sincere masking requires no human annotation, it makes it possible to train the models on huge corpora of publicly available data. Our approval approval for remote control is based on a Vision-Langue (VL) Model-A model that has learned to jointly take text descriptions and visual deputies of the sales objects.
We are considering the scenario where a household robot has had enough time to build a 3-D card over its immediate environment, included visual representations of the objects in this environment. We treat a grounding object as a problem with obtaining information, which means the model takes linguistic descriptions, “The glasses on the counter in the children’s bathroom” and pick up the corresponding object in its representation of its visual environment.
Adaptation of a VL model to this problem poses two major challenges. The first is the extent of the problem. A single household can contain 100,000 discreet items; Using a large foundation model would be insurmounted to inquire that many candidates for ounce. The second challenge is that VL models are typically trained on 2-D images, while a household robot builds a 3D card over its environment.
Gunnar A. Sigurdsson by adapting visual -language foundation models to the problem of grounding for remote object.
Bag with tricks
In our paper we present a “bag of tricks” that helps our model overmons these and other challenges.
1. Negative examples
The obvious way to meet the extent of the retrieval problem is to break it up, separately to score candidates objects in each room, and then choose the most likely candidates from each list of objects.
The problem with this approval is that the scores of the objects on each list are relative to each other. A high scoring object is one that is much more likely than Ools to be the correct reference to a command; In relation to candidates we have different lists, but its scores can dream. To improve the consistency across lists, we increase the model’s training data with negative examples – views from which the target objects are not visible. This prevents the model from becoming over confidence in its scoring of candidate objects.
2. Distance -limited investigation
Our second trick to the add -on problem is to limit the radius we do for candidate objects. During training, the model not only learns which objects best correspond to which requests, but how far it has usually been to find them. Limiting search radius makes the problem much more tractable with little loss of accuracy.
3. 3D Representations
To add the discrepancy between the 2-D data used to train the VL model and 3D data used by the robot to map its environment, we convert the 2-D coordinates for the “boundary box” around an object-rectangular delimitation of the object’s region of the image to a set of 3-D-COCIINATES: The three spatial dimensions of the middle of the border and a radius, and a radius and a radius, and a radius and a radius, and a radius, and a radius and a radius, and a radius and a radius, and a radius and a radius, It is defined as half diagonal.
4. Context vectors
Fist, we were working to travel to improve the model’s overall performance. For each point of view – that is, every rent from which the robot captures more images of the faced environment – our model produces a context vector, which is an average of the vectors similar to all the objects visible from this point of view. Adding the context vector to reprint of specific candidate objects allows the robot to say to distinguish the mirror over the sink in one bathroom from the mirror over the sink in another.
We test our approach on two benchmark data sets, each containing tens of thousands of commands and the corresponding sensor readings, and found that it significantly exceeds the previously advanced model. To test our algorithmes practical, we also implemented it on a real robot and found that it was able to perform real -time commands with high accuracy.