A quick guide to Amazon’s papers on CVPR 2024

In the last few years, foundation models and generative-IA models-and especially large language models (LLMs)-have become an important topic for AI research. It is true even in computer vision with its increased focus on vision-language models, such as Yoke LLMS and image codes.

This shift can be seen in the blanks of the Amazon papers accepted for this year’s computer vision and pattern recognition conference (CVPR 2024). A plurality of the papers deals with Vision-Langue models, while a number of other affected related topics, such as visual question answer, hallucination restriction and recovery of retrieval-aidéd. At the same time, however, classic computer vision topics, such as 3D Reconstruction, Object Tracking and Performance, remains good representatives.

3D Reconstruction

No more ambiguity in 36030 Room Layout via Bi-Layout-Estimate
Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Chen, My Sun, Cheng-Hao Kuo, Ming-Hsuan Yang

View: Multi-View Towers consists of Interpolated Denoising
Xianghui Yang, Yan Zuo, Sameera Ramasinghe, Loris Bazzani, Gil Avraham, Anton Van Den Hengel

Algorithmic information theory

Interpretable goals of conceptual equality after the complexity of auto-coding descriptive
Alessandro Achille, Greg Ver Steeg, Tian Yu Liu, Matthew Trees, Carson Klingenberg, Stefano Soatto

Geospatial analysis

Bridging remote sensors with multisensor geospatial foundation models
Boran he, Shuai Zhang, Xingjian Shi, Markus Reichstein

Mitigated hallucination

Multimodal Hallucination Control using Visual Information
Alessandro Favero, Luca Zancato, Matthew Trees, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto

Throne: an object -based hallucation benchmark for the free form generations of large vision -language models
Prannay Kaul, Zhizhong Li, Hao Yang, Yonatan Dolls, Ashwin Swaminathan, CJ Taylor, Stefano Soatto

Metric learning

Learning for transductive threshold calibration in open world recognition
Qin Zhang, Dongsheng An, Tianjun Xiao, Tong He, Qingming Tang, Ying Nian Wu, Joe Tighe, Yifan Xing, Stefano Soatto

Robustness model

GDA: Generalized diffusion to robust test time adjustment
Yun Yun Tsai, Fu-Chen Chen, Albert Chen, Junfeng Yang, Che-Chun Su, My Sun, Cheng-Hao Kuo

Objectcentric learning

Adaptive slot Caution: Object discovery with dynamic slot number
Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, Zheng Zhang

Object tracking

Self-monitored Multi-Object Tracking with Sti-consistency
Zijia Lu, Bing Shuai, Yanbei Chen, Zhenlin Xu, Davide Modolo

Estimated

MRC network: 6-DOF Pos
Yuelong Li, Yafei Mao, Raja Bala, Sunil Hadap

Picture Peers, where the left image is a camera image, and the right image overlays colored 3D models of objects, with estimated six-degree food positions, on the original image.

AI Responsible

Fairrag: Fair Human Generation Via Fair Retrie
Robik Shrestha, Yang Zou, James Chen, Zhiheng Li, Yusheng Xie, Tiffany Deng

Performed Generated Generation

CPR: Retrieving increased generation to protect copyright
Aditya Golatkar, Alessandro Achille, Luca Zancato, Yu-Xiang Wang, Ashwin Swaminathan, Stefano Soatto

Security

Sharpness conscious optimization for real -world contradictory attacks to different computer platforms with improved transferability
Muchao Ye, Xiang Xu, Qin Zhang, Jon Wu

Video-language models

Vidla: Video Language Adjustment in Scale
Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrishnan, Son Tran, Benjamin Yao, Belinda Zeng, Mubarak Shah, Trishul Chilimbi

Vision-language models

Accept modality gap: An investigation into the hyperbolic space
Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, Ajanthan Thalaiyasingam

Improving the vision language lead tower with rich supervision
Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

Groundhog: ground -affected language models for holistic segmentation
Yichi Zhang, Martin Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi (QZ) Gao, Joyce Chai

Hyperbolic learning with synthetic captions for detection of open world
Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

Non-auto-gressive sequence-to-sequence-vision-Langain models
Kunyu shi, qi dong, luis goncalves, zhuowen tu, stefano soatto

About the scalibility of diffusion-based text-to-image generation
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R. Manmatha, Ashwin Swaminathan, Zhuowen Tu, Stefano Ermon, Stefano Soatto

The effect of UNT scaling on text image adjustment. In “On the Scalabibility of Diffusion-Based Text-to-Image Generation” Amazon scientists vary an UNTT along two dimensions: channel number (Left) and transform depth (right). The employees are (1) “square blue apples on wood with circular yellow leaves”; (2) “five frosted glass bottles”; (3) “a yellow box to the right of a blue sphere”; (4) “The International Space Station flying in front of the Moon”.

Visual question answer

Gram: Global Reasoning for Multi-Sider VQA
Tsachi Blau, Sharon Fogel, King Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman

Question AWARE VISION Transform for Multimodal Reasoning
Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman

Synthesis Step by Step: Tools, Templates and LLMs as Datagenerators for Reasoning-Based Diagram VQA
Zhuowan Li, Bhavan Jasani, Money Tang, Shabnam Ghadar

Leave a Comment