The International Conference on Acoustics, Speech and Signal Treatment (ICASSP 2024) takes place on April 14 to 19 in Seoul, South Korea. Amazon is a bronze sponsor of “the world’s great and most comprehensive technical conference focusing on signal processing and its applications.”
Amazon’s presence included a workshop (reliable speech treatment), two organizers are researchers with Amazon’s artificial general intelligence (AGI) Foundation Organization: Anil Ramakrishna, Senior Applied Scientist and Rahul Gupta, Senior Manager for Applied Science. In addition, Wontak Kim, senior manager of research science with Amazon devices, will present a spotlight speech entitled “Synthetic data for algorithm development: Examples in the real world and experience.”
As in previous years, many of Amazon’s accepted papers focus on automatic speech recognition. Topics such as speech improvement, spoken-language understanding and waking word recognition are all well represented. This year’s publications also touch on dialogue, paralinguistic, pitchestimate and responsible AI. Below is a quick guide to Amazon’s more than 20 papers at the conference.
Address detection
Long -term Social Interaction Context: The key to egocentric addressey detection
Deqian Kong, Furqan Khan, Xu Zhang, Prateek Singhal, Ying Nian Wu
Detection of audio events
Cross -release from detection and remedy of audio events
Huy Phan, Byeonggeun Kim, Set Nguyen, Andrew Bydlon, Qingming Tang, Chieh-Chi Kao, Chao Wang
Automatic Speech Recognition (ASR)
Max-margin Transducer Tab: Improving sequence-discriminatory training using a strategy for Learning Strategy for Large Margin
Rupak Vignesh Swaminathan, Grant Strimel, Ariya Rastrow, Harish Mallidi, Kai Zhen, Hieu Duy Nguyen, Nathan Susanj, Thanasis Mouchtaris
Quick Form: Quick Compliance with Transducer to ASR
Sergio Duarte Torres, Arunasish Sen, Aman Rana, Lukas Drude, Alejandro Gomez Alanis, Andreas Schwarz, Leif Rādel, Volker Leutant
Significant ASR fault detection for conversation voice assistants
John Harvill, Rinat Khaziev, Scarlett Li, Randy Cogill, Lidan Wang, Gopinath Chennupati, Hari Thadakamalla
Task oriented as a catalyst for self -monitored automatic speech recognition
David M. Chan, Shalini Ghosh, Hitesh Tulsiani, Ariya Rastrow, Björn Hoffmeister
Computer vision
Skin color separation in 2D makeup -transfer with graph neural network
Masoud Mokhtari, Fatima Taheri Dezaki, Timo Bolkart, Betty Mohler Tesch, Rahul Suresh, Amin Banitalbi
Dialogue
Turn-Tuking and Backchannels Forests with Acoustic and Large Language Fusion
Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran
Paraling
Paralinguistic-enhanced Large Language Modeling of Talt Dialogue
Guan-things Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yi Gu, Shalini Ghosh, Andreas Stolcke, Hung-Yi Lee, Ivan Bulyko
Pitch Estimate
Noise-Robust DSP-ASSISTED NEURALT TONESTIMAT
Krishna Subramani, Jean-Marc Valin, Jan Buethe, Paris EMARAGDIS, MIKE GOODWIN
AI Responsible
Utilization of trust models to identify challenging data subgroups in speech models
Alkis Koudounas, Eliana Pastor, Vittorio Mazzia, Manuel Giolo, Thomas Gueuder, Elisa Real, Giuseppe Attanasio, Luca Cagliero, Sandro Cumani, Luca de Alfaro, Elena Baralis, Daniele Amberti
Speaking recognition
Post-training Embarking Adaptation to Disclosure of Enrollment and Runtime-Rights Recognition Models
Chenyang Gao, Brecht Desplanques, Chelsea J.-T. Ju, Aman Chadha, Andreas Stolcke
Speech improvement
NOLACE: Improving low-in-complexity Speech Codec improvement through adaptive temporal forming
Jan Buethe, Ahmed Mustafa, Jean-Marc Valin, Karim Helwani, Mike Goodwin
Stereo-talk improvement in real time with the preservation of spatial cue based on double-path structure
Masahito Togami, Jean-Marc Valin, Karim Helwani, Ritwik Giri, Umut Isik, Mike Goodwin
Scalable and Effective Speech Improvement Modified Cold Diffusion: A Remaining Learning Method
Minje Kim, Trausti Kristjansson
Spoke linguistic understanding
S2E: Against an end-to-ending device resoltoly resolution from acoustic signal
Kangrui Ruan, Cynthia He, Jiyang Wang, Xiaozhou Joey Zhou, Helian Feng, Ali Kebarighotbi
Against ASR robust spoken language understanding through learning in context with word confusion network
Kevin Everson, Yi Gu, Huck Yang, Prashanth Gurunath Shivakumar, Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh, Wael Hamza, Hung-Yi Lee, ARIYA RASTOW, Andreas Stolcke
Text-to-speech
Mapache: masked parallel transformer into advanced speech editing and synthesis
Guillermo Cambara Ruiz, Patrick Tobing, Mikolaj Babianski, Ravi Chander Vipperla, Duo Wang, Ron Shmelkin, Giuseppe Coccia, Orazio Angelini, Arnaud Joly, Mateusz Lajszczak, Vincent Pollet
Wake Word recognition
Hot-fixing wake word-recognition to end-to-end ASR via Neural Model reprogramming
Pin-jui Ku, I-Fan Chen, Huck Yang, Anirudh Raju, Pranav Dheram, Pegah Ghahremani, Brian King, Jing Liu, Roger Ren, Phan Nidadavoli
Maximum Opposite Sound Increase to Key Word Potting
Zuzhao Ye, Gregory Ciccarelli, Brian Kunnis
On-Device Restricted Self-monitored Learning for Keyword Potting Via Quantity Attention to pre-training and fine tuning
Gene-Ping Yang, Yue Gu, Sasks Macha, Qingming Tang, Yuzong Liu