Vision-language models that can handle input with more images
Vision-language models that map images and text into a common representative space have shown remacable performance on a wide range of multimodal AI tasks. But they are typically trained on text images: Each text input is connected to a single image. This limits the usability of the models. For example, you may wish that a … Read more