


default search action
ASRU 2025: Honolulu, HI, USA
- IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025, Honolulu, HI, USA, December 6-10, 2025. IEEE 2025, ISBN 979-8-3315-4426-3

- Heitor R. Guimarães, Ke Tan, Juan Azcarreta, Jesus Alvarez, Prabhav Agrawal, Ashutosh Pandey, Buye Xu:

Improving Resource-Efficient Speech Enhancement via Neural Differentiable DSP Vocoder Refinement. 1-7 - Shusuke Komatsu, Kazuyo Onishi, Koki Tanaka, Dohyun Kim, Koichiro Yoshino:

Efficient ASR Domain Adaptation with Long Noun Phrases: Harnessing the Linguistic Characteristics of Japanese. 1-7 - Jisoo Park, Seonghak Lee, Guisik Kim, Taewoo Kim, Junseok Kwon:

Lightweight Wasserstein Audio-Visual Model for Unified Speech Enhancement and Separation. 1-8 - Waris Quamer, Ricardo Gutierrez-Osuna:

DarkStream: real-time speech anonymization with low latency. 1-7 - Lilit Grigoryan, Vladimir Bataev, Nikolay Karpov, Andrei Andrusenko, Vitaly Lavrukhin, Boris Ginsburg:

FlexCTC: GPU-powered CTC Beam Decoding With Advanced Contextual Abilities. 1-7 - Yueqian Lin, Zhengmian Hu, Jayakumar Subramanian, Qinsi Wang, Nikos Vlassis, Hai Li, Yiran Chen:

AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning. 1-4 - Wonjune Kang, Deb Roy:

Expressive Speech Retrieval using Natural Language Descriptions of Speaking Style. 1-8 - Yamato Ohtani, Takuma Okamoto, Tomoki Toda, Hisashi Kawai:

Voice Factor Control Using FIR-Based Fast Neural Vocoder for Speech Generation Applications. 1-4 - Mu Yang, Szu-Jui Chen, Jiamin Xie, John H. L. Hansen:

Bridging the Modality Gap: Softly Discretizing Audio Representation for LLM-based Automatic Speech Recognition. 1-7 - Phuong Tuan Dat, Tran Huy Dat:

KAN-AST: Kolmogorov-Arnold Network based Audio Spectrogram Transformer for Audio Classification. 1-6 - Yi-Cheng Lin, Huang-Cheng Chou, Yu-Hsuan Li Liang, Hung-Yi Lee:

EMO-Debias: Benchmarking Gender Debiasing Techniques in Multi-Label Speech Emotion Recognition. 1-8 - Kashaf Gulzar, Dominik Wagner, Sebastian P. Bayerl, Florian Hönig, Tobias Bocklet, Korbinian Riedhammer:

On the Difficulty of Token-Level Modeling of Dysfluency and Fluency Shaping Artifacts. 1-6 - Yangui Fang, Baixu Cheng, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong:

Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction. 1-7 - Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura:

Learning Marmoset Vocal Patterns with a Masked Autoencoder for Robust Call Segmentation, Classification, and Caller Identification. 1-7 - Maja J. Hjuler, Harald V. Skat-Rørdam, Line H. Clemmensen, Sneha Das:

EmoTale: An Enacted Speech-emotion Dataset in Danish. 1-7 - P. E. Ameenudeen, Charumathi Narayanan, Sriram Ganapathy:

ULTRAS - Unified Learning of Transformer Representations for Audio and Speech Signals. 1-7 - Edresson Casanova, Chen Chen, Kevin Hu, Ankita Pasad, Elena Rastorgueva, Seelan Lakshmi Narasimhan, Slyne Deng, Ehsan Hosseini-Asl, Piotr Zelasko, Valentin Mendelev, Subhankar Ghosh, Yifan Peng, Zhehuai Chen, Jason Li, Jagadeesh Balam, Vitaly Lavrukhin, Boris Ginsburg:

Open Full-duplex Voice Agent with Speech-to-Speech Language Model. 1-4 - Siyuan Chen, Mojtaba Kadkhodaie Elyaderani, Jing Su, Susanne Burger, Thomas Schaaf:

LLM-Based Dictation Detection from Doctor-Patient Conversations. 1-7 - Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, Xie Chen:

Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model. 1-6 - Takafumi Moriya, Masato Mimura, Tomohiro Tanaka, Hiroshi Sato, Ryo Masumura, Atsunori Ogawa:

All-in-One ASR: Unifying Encoder-Decoder Models of CTC, Attention, and Transducer in Dual-Mode ASR. 1-8 - Wen-Yu Chang, Tzu-Hung Huang, Chih-Ho Chen, Yun-Nung Chen:

From Simulation to Strategy: Automating Personalized Interaction Planning for Conversational Agents. 1-7 - Jingwen Liu, Kan Jen Cheng, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang-Cheng Chou, Tingle Li, Guan-Ting Lin, Gopala Anumanchipalli:

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems. 1-8 - Shaun Cassini, Thomas Hain

, Anton Ragni:
Emphasis Sensitivity in Speech Representations. 1-8 - Jianwei Cui, Shihao Chen, Yu Gu, Jie Zhang, Liping Chen, Na Li, Chengxing Li, Shan Yang, Li-Rong Dai:

Sinba: Singing-To-Accompaniment Generation With Pitch Guidance Via Mamba-Based Language Model. 1-8 - Gokul Karthik Kumar, Rishabh Saraf, Ludovick Lepauloux, Abdul Muneer, Billel Mokeddem, Hakim Hacid:

Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data. 1-8 - Yu-Chun Liu, Li-Ting Pai, Yi-Cheng Wang, Bi-Cheng Yan, Hsin-Wei Wang, Chi-Han Lin, Juan-Wei Xu, Berlin Chen:

PRIME: Novel Prompting Strategies for Effective Biasing Word Recognition in Contextualized ASR. 1-7 - Fabian Ritter Gutierrez, Yi-Cheng Lin, Jui-Chiang Wei, Jeremy H. M. Wong, Nancy F. Chen, Hung-Yi Lee:

ASTAR-NTU solution to AudioMOS Challenge 2025 Track1. 1-4 - Biao Liu

, Zengqiang Shang, Haoyuan Xie, Mou Wang
, Xin Liu, Pengyuan Zhang:
Pitch-Assistant Harmonic Recovery for Efficient Speech Enhancement. 1-5 - Chien-Chun Wang, En-Lun Yu, Jeih-Weih Hung, Shih-Chieh Huang, Berlin Chen:

SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization. 1-8 - Yu Zhang, Baotong Tian, Zhiyao Duan:

Conan: A Chunkwise Online Network for Zero-Shot Adaptive Voice Conversion. 1-8 - Muhammad Shakeel, Yui Sudo, Yifan Peng, Chyi-Jiunn Lin, Shinji Watanabe:

Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder. 1-7 - Shi-Wook Lee:

Robust Speech Emotion Recognition via Classifier Retraining on Mixup-Augmented Representations. 1-7 - Liming Wang, Saurabhchand Bhati, Cody Karjadi, Rhoda Au, James R. Glass:

Recognizing Dementia from Neuropsychological Tests with State Space Models. 1-7 - Wenjie Tian, Xinfa Zhu, Hanke Xie, Zhen Ye, Wei Xue, Lei Xie:

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis. 1-7 - Umberto Cappellazzo, Minsu Kim, Stavros Petridis:

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs. 1-8 - Xi Xuan

, Zimo Zhu, Wenxin Zhang, Yi-Cheng Lin, Tomi Kinnunen:
Fake-Mamba: Real-Time Speech Deepfake Detection Using Bidirectional Mamba as Self-Attention's Alternative. 1-8 - Bingshen Mu, Yiwen Shao, Kun Wei, Dong Yu, Lei Xie:

Efficient Scaling for LLM-based ASR. 1-7 - Qingzheng Wang, Hye-Jin Shim, Jiancheng Sun, Shinji Watanabe:

Geolocation-Aware Robust Spoken Language Identification. 1-7 - Chiori Hori, Yoshiki Masuyama, Siddarth Jain, Radu Corcodel, Devesh K. Jha, Diego Romeres, Jonathan Le Roux:

Robot Confirmation Generation and Action Planning Using Long-context Q-Former Integrated with Multimodal LLM. 1-7 - Sung-Lin Yeh, Yen Meng, Hao Tang:

Whisper Has an Internal Word Aligner. 1-7 - Shashi Kumar, Srikanth R. Madikeri, Esaú Villatoro-Tello, Sergio Burdisso, Pradeep Rangappa, Roberto Carofilis, Petr Motlícek, Karthik Pandia, Shankar Venkatesan, Kadri Hacioglu, Andreas Stolcke:

TokenVerse++: Towards Flexible Multitask Learning with Dynamic Task Activation. 1-7 - George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish R. Mittal, Brian Kingsbury, David Haws, Edmilson da Silva Morais, Gakuto Kurata, Hagai Aronowitz, Ibrahim Ibrahim, Hong-Kwang Kuo, Kate Soule, Luis A. Lastras, Masayuki Suzuki, Ron Hoory, Samuel Thomas, Sashi Novitasari, Takashi Fukuda, Vishal Sunder, Xiaodong Cui, Zvi Kons:

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities. 1-7 - Tuan Nguyen, Huy-Dat Tran:

AsyncSwitch: Asynchronous Text-Speech Adaptation for Code-Switched ASR. 1-7 - Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Ken'ichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li:

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model. 1-7 - Katsuhiko Yamamoto, Koichi Miyazaki, Shogo Seki:

The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models. 1-4 - Gil Ayache, Menachem Pirchi, Aviv Navon, Aviv Shamsian, Gill Hetz, Joseph Keshet:

WhisperNER: Unified Open Named Entity and Speech Recognition. 1-6 - DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang:

Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion. 1-7 - Salima Mdhaffar, Haroun Elleuch, Chaimae Chellaf, Ha Nguyen, Yannick Estève:

SENSE models: an open source solution for multilingual and multimodal semantic-based tasks. 1-8 - Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola García-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner:

Scalable Controllable Accented TTS. 1-8 - Ramesh Gundluru, Shubham Gupta, K. Sri Rama Murty:

Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting. 1-7 - Duygu Altinok:

Whispering Context: Distilling Syntax and Semantics for Long Speech Transcripts. 1-7 - Jian You, Xiangfeng Li, Erwan Zerhouni:

Enhancing Fully Formatted End-to-End Speech Recognition with Knowledge Distillation via Multi-Codebook Vector Quantization. 1-6 - Erica Cooper, Takuma Okamoto, Yamato Ohtani, Tomoki Toda, Hisashi Kawai:

Layer-wise Analysis for Quality of Multilingual Synthesized Speech. 1-7 - Chien-Chun Wang, Kuan-Tang Huang, Cheng-Yeh Yang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen:

QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems. 1-4 - Tsung-Han Wu, Joseph E. Gonzalez, Trevor Darrell, David M. Chan:

CLAIRA: Leveraging Large Language Models to Judge Audio Captions. 1-7 - Huakang Chen, Yuepeng Jiang, Guobin Ma, Chunbo Hao, Shuai Wang, Jixun Yao, Ziqian Ning, Meng Meng, Jian Luan, Lei Xie:

DiffRhythm+: Controllable and Flexible Full-Length Song Generation with Preference Optimization. 1-8 - Kyo-Won Koo, Chan-yeong Lim, Jee-Weon Jung, Hye-Jin Shim, Ha-Jin Yu:

Token-based Attractors and Cross-attention in Spoof Diarization. 1-7 - Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw:

MNSC: Advancing Singlish Speech Understanding with Carefully Curated Corpora. 1-8 - Takashi Maekaku, Keita Goto, Jinchuan Tian, Yusuke Shinohara, Shinji Watanabe:

Evaluating Self-Supervised Speech Models Via Text-Based LLMs. 1-8 - Pu Wang, Shinji Watanabe, Hugo Van hamme:

SSVD: Structured SVD for Parameter-Efficient Fine-Tuning and Benchmarking under Domain Shift in ASR. 1-7 - Tianlun Zuo, Jingbin Hu, Yuke Li, Xinfa Zhu, Hai Li, Ying Yan, Junhui Liu, Danming Xie, Lei Xie:

XEmoRAG: Cross-Lingual Emotion Transfer with Controllable Intensity Using Retrieval-Augmented Generation. 1-7 - Chih-Kai Yang, Neo Ho, Yi-Jyun Lee, Hung-Yi Lee:

AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models. 1-8 - Dominik Wagner, Ilja Baumann, Natalie Engert, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet:

Joint ASR and Speech Attribute Prediction for Conversational Dysarthric Speech Analysis with Multimodal Language Models. 1-8 - Heng-Jui Chang, Saurabhchand Bhati, James R. Glass, Alexander H. Liu:

USAD: Universal Speech and Audio Representation via Distillation. 1-8 - Jie-Shiang Yang, Jing-Tong Tzeng, Chi-Chun Lee:

Personalized Federated Learning with Fuzzy Clustering for Dysarthric Speech Recognition. 1-7 - Wei Wang, Wangyou Zhang, Chenda Li, Jaitong Shi, Shinji Watanabe, Yanmin Qian:

Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment. 1-8 - Linping Xu, Ziqian Wu, Dejun Zhang:

Audio Aesthetics Prediction System QAM16k Based on Pre-trained Audio Encoder. 1-4 - Wanying Ge, Xin Wang, Xuechen Liu, Junichi Yamagishi:

Post-training for Deepfake Speech Detection. 1-8 - Fabian Ritter Gutierrez, Yi-Cheng Lin, Jeremy H. M. Wong, Hung-Yi Lee, Eng Siong Chng, Nancy F. Chen:

A correlation-permutation approach for speech-music encoders model merging. 1-7 - Yerin Ryu, Inseop Shin, Chanwoo Kim:

Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence. 1-7 - Carlos Carvalho, Francisco Teixeira, Catarina Botelho, Anna Pompili, Rubén Solera-Ureña, Sérgio Paulo, Mariana Julião, Thomas Rolland, John Mendonça, Diogo A. P. Nunes, Isabel Trancoso, Alberto Abad:

CAMÕES: A Comprehensive Automatic Speech Recognition Benchmark for European Portuguese. 1-8 - Yi-Jen Shih, David Harwath:

Unifying Model and Layer Fusion for Speech Foundation Models. 1-7 - Yurie Koga, Shunsuke Kando, Yusuke Miyao:

Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition? 1-8 - Thomas Thebaud, Yen-Ju Lu, Matthew Wiesner, Peter Viechnicki, Najim Dehak:

Enhancing Dialogue Annotation with Speaker Characteristics Leveraging a Frozen LLM. 1-7 - Mark Lindsey, Francis Kubala, Richard M. Stern:

Iterative Feedback in the Online Active Learning Paradigm. 1-6 - Ryo Masumura, Tomohiro Tanaka, Naoki Makishima, Mana Ihori, Shota Orihashi, Naotaka Kawata, Taiga Yamane, Satoshi Suzuki, Takafumi Moriya:

Phoneme Overlapping-Aware Pre-Training with External Text Resources for Multi-Talker ASR. 1-8 - Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei:

Qieemo: Multimodal Emotion Recognition Based on the ASR Backbone. 1-5 - Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogério Feris, James R. Glass:

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM? 1-7 - Minh Vu, Phuong Tuan Dat, Kah Kuan Teh, Van Tuan Nguyen, Tran Huy Dat:

Utilizing Kolmogorov-Arnold Network in Self-Supervised Learning for Speaker Diarization. 1-6 - Sara Barahona, Ladislav Mosner, Themos Stafylakis, Oldrich Plchot, Junyi Peng, Lukás Burget, Jan Cernocký:

State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data. 1-7 - Yangui Fang, Jing Peng, Xu Li, Yu Xi, Chengwei Zhang, Guohui Zhong, Kai Yu:

Low-Resource Domain Adaptation for Speech LLMs via Text-Only Fine-Tuning. 1-7 - Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mandal, Prasanta Kumar Ghosh:

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction. 1-7 - Imen Talbi, Christopher Gebauer, Lars Rumberg, Edith Beaulac, Hanna Ehlert, Jörn Ostermann:

Reliability of Lexical Richness Measures for ASR-Based Children's Speech Assessment. 1-7 - Tomoya Mizumoto, Yusuke Fujita, Hao Shi, Lianbo Liu, Atsushi Kojima, Yui Sudo:

Evaluating Japanese Dialect Robustness Across Speech and Text-based Large Language Models. 1-7 - Adria Mallol-Ragolta, Björn W. Schuller:

ProtoCLAP - Prototypical Contrastive Language-Audio Pretraining. 1-7 - Ilja Baumann, Dominik Wagner, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet:

Text-Guided Speech Representations for Language Acquisition Assessment. 1-8 - I-Ming Lin, Xuanjun Chen, Lin Zhang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang:

Towards Generalized Source Tracing for Codec-Based Deepfake Speech. 1-8 - Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schüldt, Saikat Chatterjee:

Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech. 1-6 - Xiaoxun Wu, Kailai Shen, Yuheng Huang, Naiyuan Li, Diqun Yan:

DyMEvalNet: Dynamic Text-Audio-Personalization Fusion for Multimodal Music Quality Assessment. 1-4 - Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Moto Hira, Jeff Hwang, Vineel Pratap, Ju Lin, Ming Sun, Florian Metze:

Long-Form Fuzzy Speech-to-Text Alignment for 1000+ Languages. 1-3 - Wanting Huang, Weiran Wang:

A Neural Model for Contextual Biasing Score Learning and Filtering. 1-7 - Virat Shejwalkar, Om Thakkar, Steve Chien, Nicole Rafidi, Arun Narayanan:

Improving Streaming ASR via Differentially Private Fusion of Data from Multiple Sources. 1-6 - Jiajun He, Naoki Sawada, Koichi Miyazaki, Tomoki Toda:

PARCO: Phoneme-Augmented Robust Contextual ASR via Contrastive Entity Disambiguation. 1-7 - Joonyong Park, Shinnosuke Takamichi, David M. Chan, Shunsuke Kando, Yuki Saito, Hiroshi Saruwatari:

Analysing the Language of Neural Audio Codecs. 1-7 - Takuma Okamoto:

Speech Masking System Based on Spatially Separated Multiple TTS Maskers With A Compact Circular Loudspeaker Array. 1-4 - Jiatong Shi, Chunlei Zhang, Jinchuan Tian, Junrui Ni, Hao Zhang, Shinji Watanabe, Dong Yu:

Continual Pre-training for Codec-Based Speech LLMs: Balancing Understanding and Generation. 1-8 - Jeremy H. M. Wong, Nancy F. Chen:

Obtaining objective labels and analysing annotator subjectivity by using a Rasch model for ordinal speech processing. 1-7 - Dongji Gao, Chenda Liao, Changliang Liu, Matthew Wiesner, Leibny Paola García-Perera, Daniel Povey, Sanjeev Khudanpur, Jian Wu:

WST: Weakly Supervised Transducer for Automatic Speech Recognition. 1-7 - Zongli Ye, Jiachen Lian, Akshaj Gupta, Xuanru Zhou, Haodong Li, Krish Patel, Hwi Joo Park, Dingkun Zhou, Chenxu Guo, Shuhe Li, Sam Wang, Iris Zhou, Cheol Jun Cho, Zoe Ezzes, Jet Vonk, Brittany Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini, Gopala Anumanchipalli:

LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness. 1-8 - Meng Yu, Dong Yu:

Deep Audio Zooming: Creating a Sound Barrier With Microphone Array Processing. 1-8 - Hashim Ali, Surya Subramani, Nithin Sai Adupa, Lekha Bollinani, Sali El-Loh, Hafiz Malik

:
Multilingual Dataset Integration Strategies for Robust Audio Deepfake Detection: A SAFE Challenge System. 1-8 - James Tavernor, Emily Mower Provost:

More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition. 1-7 - Haoran Wang, Guanyu Chen, Bohan Li, Hankun Wang, Yiwei Guo, Zhihan Li, Xie Chen, Kai Yu:

Towards General Discrete Speech Codec for Complex Acoustic Environments: A Study of Reconstruction and Downstream Task Consistency. 1-6 - Piotr Zelasko, Kunal Dhawan, Daniel Galvez, Krishna C. Puvvada, Ankita Pasad, Travis M. Bartley, Nithin Rao Koluguri, Ke Hu, Vitaly Lavrukhin, Jagadeesh Balam, Boris Ginsburg:

Training and Inference Efficiency of Encoder-Decoder Speech Models. 1-7 - Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-Yi Lee, Hao Tang:

Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers. 1-7 - Ming-Hao Hsu, Hung-Yi Lee:

SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition. 1-7 - Go Nishikawa, Wataru Nakata, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari, Tomohiko Nakamura:

Multi-Sampling-Frequency Naturalness MOS Prediction Using Self-Supervised Learning Model with Sampling-Frequency-Independent Layer. 1-4 - Wei-Ping Huang, Guan-Ting Lin, Hung-Yi Lee:

SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR. 1-7 - Yuezhang Peng, Yuxin Liu, Yao Li, Sheng Wang, Fei Wen, Xie Chen:

ZO-ASR: Zeroth-Order Fine-Tuning of Speech Foundation Models without Back-Propagation. 1-7 - Tolúlopé Ògúnrèmí, Christopher D. Manning, Dan Jurafsky, Karen Livescu:

Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models. 1-7 - Hao Shi, Yusuke Fujita, Tomoya Mizumoto, Lianbo Liu, Atsushi Kojima, Yui Sudo:

Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition. 1-8 - Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah:

JOOCI: a Novel Method for Learning Comprehensive Speech Representations. 1-8 - Yang Cui, Peter Pan, Lei He, Sheng Zhao:

Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation. 1-7 - Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Hsin-Min Wang, Yu Tsao:

Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings. 1-4 - Yu Xi, Xiaoyu Gu, Haoyu Li, Jun Song, Bo Zheng, Kai Yu:

Masked Self-distilled Transducer-based Keyword Spotting with Semi-autoregressive Decoding. 1-7 - Xingyu Shen

, Wei-Ping Zhu, Benoît Champagne:
PhysMVNet: Physics-Informed End-to-End MVDR Beamformer with Residual Spectral Mapping for Multichannel Speech Enhancement. 1-7 - Ju-Chieh Chou, Jiawei Zhou, Karen Livescu:

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling. 1-7 - Wen-Chin Huang, Hui Wang, Cheng Liu, Yi-Chiao Wu, Andros Tjandra, Wei-Ning Hsu, Erica Cooper, Yong Qin, Tomoki Toda:

The AudioMOS Challenge 2025. 1-8 - Hongming Guo, Ruibo Fu, Yizhong Geng, Shuchen Shi, Tao Wang, Chunyu Qiang, Ya Li, Zhengqi Wen, Yukun Liu, Xuefei Liu, Chenxing Li:

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation. 1-6 - Arnon Turetzky, Avihu Dekel, Nimrod Shabtay, Slava Shechtman, David Haws, Hagai Aronowitz, Ron Hoory, Yossi Adi:

Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion. 1-8 - Sarenne Wallbridge, Adaeze Adigwe, Peter Bell:

Can self-supervised speech models predict the perceived acceptability of prosodic variation? 1-8 - Alexander Polok, Santosh Kesiraju, Karel Benes, Bolaji Yusuf, Lukás Burget, Jan Cernocký:

DeCRED: Decoder-Centric Regularization for Encoder-Decoder Based Speech Recognition. 1-7 - Chaohao Lin, Xu Zheng, Kaida Wu, Peihao Xiang, Ou Bai:

Emotional Styles Hide in Deep Speaker Embeddings: Disentangle Deep Speaker Embeddings for Speaker Clustering. 1-6 - Yiwen Zhao, Jiatong Shi, Yuxun Tang, William Chen, Shinji Watanabe:

Robust Training of Singing Voice Synthesis Using Prior and Posterior Uncertainty. 1-8 - Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu:

Benchmarking Prosody Encoding in Discrete Speech Tokens. 1-8 - Mana Ihori, Taiga Yamane, Naotaka Kawata, Naoki Makishima, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura:

Few-shot Personalization via In-Context Learning for Speech Emotion Recognition based on Speech-Language Model. 1-8 - Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, Hung-Yi Lee:

Full-Duplex-Bench: A Benchmark to Evaluate Full-Duplex Spoken Dialogue Models on Turn-taking Capabilities. 1-8 - Cihan Xiao, Ruixing Liang, Xiangyu Zhang, Mehmet Emre Tiryaki, Veronica Bae, Lavanya Shankar, Rong Yang, Ethan Poon, Emmanuel Dupoux, Sanjeev Khudanpur, Leibny Paola García-Perera:

CASPER: A Large Scale Spontaneous Speech Dataset. 1-7 - Guo Chen, Kai Li, Runxuan Yang, Xiaolin Hu:

Time-Frequency-Based Attention Cache Memory Model for Real-Time Speech Separation. 1-6 - Yunbin Deng:

Acoustic Phonetic Temporal Speech Representation. 1-7 - Tien-Hong Lo, Szu-Yu Chen, Yao-Ting Sung, Berlin Chen:

An Effective Strategy for Modeling Score Ordinality and Non-uniform Intervals in Automated Speaking Assessment. 1-7 - Chenda Li, Wangyou Zhang, Wei Wang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Yihui Fu, Marvin Sach, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian:

Less is More: Data Curation Matters in Scaling Speech Enhancement. 1-8 - Saurabh Kumar, Sumit Sharma, Deekshitha G, Abhayjeet Singh, Amartyaveer, Sathvik Udupa, Sandhya Badiger, Sanjeev Khudanpur, Sunayana Sitaram, Srinivasan Umesh, Bhuvana Ramabhadran, Brian Kingsbury, Hema A. Murthy, Srikanth S. Narayanan, Howard Lakougna, Prasanta Kumar Ghosh:

MADASR 2.0: Multi-Lingual Multi-Dialect ASR Challenge in 8 Indian Languages. 1-7 - Beilong Tang, Xiaoxiao Miao, Xin Wang, Ming Li:

SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization. 1-8 - Shreya G. Upadhyay, Carlos Busso, Chi-Chun Lee:

Speaker Style-Aware Phoneme Anchoring For Improved Cross-Lingual Speech Emotion Recognition. 1-8 - Jeremy H. M. Wong, Muhammad Huzaifah, Hardik B. Sailor, Shuo Sun, Kye Min Tan, Bin Wang, Qiongqiong Wang, Wenyu Zhang, Xunlong Zou, Nancy F. Chen, Ai Ti Aw:

Diversity and complementarity of speech encoders across diverse tasks in a multi-modal large language model. 1-8 - Jiahe Wang, Chenda Li, Wei Wang, Wangyou Zhang, Samuele Cornell, Marvin Sach, Robin Scheibler, Kohei Saijo, Yihui Fu, Zhaoheng Ni, Anurag Kumar, Tim Fingscheidt, Shinji Watanabe, Yanmin Qian:

URGENT-PK: Perceptually-Aligned Ranking Model Designed for Speech Enhancement Competition. 1-7 - Thomas Ranzenberger, Dominik Wagner, Steffen Freisinger, Tobias Bocklet, Korbinian Riedhammer:

Improving Multimodal Speech-To-Slide Alignment for Academic Lectures with Vision LLMs. 1-7 - Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg:

TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree. 1-7 - Shaoshi Ling, Guoli Ye:

Customizing Speech Recognition Model with Large Language Model Feedback. 1-6 - Mohamed Elminshawi, Srikanth Raj Chetupalli, Emanuël A. P. Habets:

AdaBit-TasNet: Speech Separation with Inference Adaptable Precision. 1-6 - Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, Bhiksha Raj:

CoLMbo: Speaker Language Model for Descriptive Profiling. 1-7 - Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen:

Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora. 1-8 - Ashi Garg, Zexin Cai, Henry Li Xinyuan, Leibny Paola García-Perera, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews:

Rapidly Adapting to New Voice Spoofing: Few-Shot Detection of Synthesized Speech Under Distribution Shifts. 1-7 - Hyun-Soo Kim, Da-Hee Yang, Joon-Hyuk Chang:

A Momentum-Based Framework with Contrastive Data Generation for Robust Sound Source Localization. 1-7 - Tzu-Wen Hsu, Ke-Han Lu, Cheng-Han Chiang, Hung-Yi Lee:

Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding. 1-7 - Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocký:

Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training. 1-8 - Shota Horiguchi, Naohiro Tawara, Takanori Ashihara, Atsushi Ando, Marc Delcroix:

Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization? 1-8 - Kuan-Tang Huang, Li-Wei Chen, Hung-Shin Lee, Berlin Chen, Hsin-Min Wang:

Revealing the Role of Audio Channels in ASR Performance Degradation. 1-7 - Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani:

mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks. 1-8 - Simon Dahl Jepsen

, Mads Græsbøll Christensen
, Jesper Rindom Jensen
:
A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References*. 1-8 - Yi-Cheng Lin, Jia-Hung Chen, Hung-Yi Lee:

MMMOS: Multi-domain Multi-axis Audio Quality Assessment. 1-4 - Zhu Han, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, Daniel Povey:

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching. 1-8 - Séverin Baroudi, Hervé Bredin, Joseph Razik, Ricard Marxer:

On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation. 1-7 - Cheng-Kang Chou, Chan-Jan Hsu, Ho-Lam Chung, Liang-Hsuan Tseng, Hsi-Chun Cheng, Yu-Kuan Fu, Kuan-Po Huang, Hung-Yi Lee:

A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data. 1-8 - Jiajian Chen, Jiakang Chen, Hang Chen, Qing Wang, Yu Gao, Jun Du:

MEAN-RIR: Multi-Modal Environment-Aware Network for Robust Room Impulse Response Estimation. 1-7 - Ryo Fukuda, Takatomo Kano, Naohiro Tawara, Marc Delcroix, Atsunori Ogawa, Yuya Chiba, Atsushi Ando:

Predictive ASR and Turn-taking Prediction at Once: Towards More Responsive Spoken Dialog System. 1-7 - Haibin Yu, Jiayi Zhou, Wei Wang, Zhiming Wang, Huijia Zhu, Yanmin Qian:

Advancing Controllable Music Generation with Latent Rectified Flow Guided by Rhythm and Harmony. 1-7 - Yuan Tseng, Titouan Parcollet, Rogier C. van Dalen, Shucong Zhang, Sourav Bhattacharya:

Evaluation of LLMs in Speech is Often Flawed: Test Set Contamination in Large Language Models for Speech Recognition. 1-8 - Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen:

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning. 1-7 - Jing-Han Chen, Bo-Hao Su, Ya-Tse Wu, Chi-Chun Lee:

RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance. 1-7 - Yuekai Zhang, Shuang Yu, Junjie Lai:

Efficient Deployment of Large Speech Recognition Models on GPU. 1-4 - Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, Wei-Ning Hsu:

Meta Audiobox Aesthetics: Unified Automatic Assessment for Speech, Music and Sound. 1-8 - Yingke Zhu, Lahiru Samarakoon:

Non-Autoregressive Multi-Speaker ASR with Decoupled Speaker Change Detection. 1-6 - Jiatong Shi, Haoran Wang, William Chen, Chenda Li, Wangyou Zhang, Jinchuan Tian, Shinji Watanabe:

PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning. 1-8 - Beilong Tang, Bang Zeng, Ming Li:

LauraTSE: Target Speaker Extraction using Auto-Regressive Decoder-Only Language Models. 1-8 - Ya-Tse Wu, Chi-Chun Lee:

ASR for Affective Speech: Investigating Impact of Emotion and Speech Generative Strategy. 1-7 - Jesús Villalba, Jonas Borgstrom, Prabhav Singh, Leibny Paola García, Pedro A. Torres-Carrasquillo, Najim Dehak:

The JHU-MIT System for NIST SRE24: Post-Evaluation Analysis. 1-7 - Haoyu Wang, Bei Liu, Hang Shao, Bo Xiao, Ke Zeng, Guanglu Wan, Yanmin Qian:

OOQ: Outlier-Oriented Quantization for Efficient Large Language Models. 1-7 - Bo Ren, Yu Shi, Jinyu Li:

Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems. 1-7 - Yunkyu Lim, Jihwan Park, Hyung Yong Kim, Hanbin Lee, Byeong-Yeol Kim:

Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models. 1-7 - Saba Tabatabaee, Suzanne Boyce, Liran Oren, Mark Tiede, Carol Y. Espy-Wilson:

Acoustic to Articulatory Speech Inversion for Children with Velopharyngeal Insufficiency. 1-7 - Navin Raj Prabhu, Danilo de Oliveira, Nale Lehmann-Willenbrock, Timo Gerkmann:

Enhancing In-the-Wild Speech Emotion Conversion with Resynthesis-based Duration Modeling. 1-7 - Wei-Cheng Tseng, David Harwath:

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs. 1-7 - Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe:

Spiralformer: Low Latency Encoder for Streaming Speech Recognition with Circular Layer Skipping and Early Exiting. 1-7 - Leander Melroy Maben, Gayathri Ganesh Lakshmy, Srijith Radhakrishnan, Siddhant Arora, Shinji Watanabe:

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks. 1-4 - Yuxuan Hu, Haibin Wu, Ruchao Fan, Xiaofei Wang, Heng Lu, Yao Qian, Jinyu Li:

SLM-S2ST: A multimodal language model for direct speech-to-speech translation. 1-8 - Jungwoo Heo, Hyun-seo Shin, Chan-Yeong Lim, Kyo-Won Koo, Seung-bin Kim, Jisoo Son, Ha-Jin Yu:

SV-Mixer: Replacing the Transformer Encoder with Lightweight MLPs for Self-Supervised Model Compresison in Speaker Verification. 1-8 - Zihan Pan, Hardik B. Sailor, Jinyang Wu:

MoLEx: Mixture of LoRA Experts in Speech Self-Supervised Models for Audio Deepfake Detection. 1-8 - Jiatong Shi, Bo-Hao Su, Shikhar Bharadwaj, Yiwen Zhao, Shih-Heng Wang, Jionghao Hang, Haoran Wang, Wei Wang, Wenhao Feng, Yuxun Tang, Nezih Topaloglu, Siddhant Arora, Jinchuan Tian, William Chen, Hye-jin Shim, Wangyou Zhang, Wen-Chin Huang, Shinji Watanabe:

VERSA-v2: A Modular and Scalable Toolkit for Speech and Audio Evaluation with Expanded Metrics, Visualization, and LLM Integration. 1-4 - Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-Yi Lee:

CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition. 1-8 - Yang Liu, Li Wan, Yiteng Huang, Yong Xu, Yangyang Shi, Saurabh Adya, Ming Sun, Florian Metze:

MMW: Side Talk Rejection Multi-Microphone Whisper On Smart Glasses. 1-8 - Chen Zhang, Linfeng Feng, Zhi Liu, Xiao-Lei Zhang, Xuelong Li:

MBENet: Bone-conduction and Air-conduction Fusion Network for Target Speaker Extraction. 1-5 - Yangbiao Li, Xiaofen Xing, Jialong Mai, Jingyuan Xing, Xiangmin Xu:

Intermediate-Selective Feature Enhancement for Speech Emotion Recognition. 1-7 - Aristeidis Papadopoulos, Naomi Harte:

Interpreting the Role of Visemes in Audio-Visual Speech Recognition. 1-8 - Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola García-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews:

GenVC: Self-Supervised Zero-Shot Voice Conversion. 1-8 - Jinsheng Chen, Yuki Saito, Dong Yang, Naoko Tanji, Hironori Doi, Byeongseon Park, Yuma Shirahata, Kentaro Tachibana, Hiroshi Saruwatari:

CAVIARES: Corpus for Audio-Visual Expressive Voice Agent. 1-4 - Tina Raissi, Nick Rossenbach, Ralf Schlüter:

Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions. 1-7 - Shuju Shi:

L2 Vowel Acquisition Analysis at the Inventory Level. 1-7 - Alexandrine Fortier, Sonal Joshi, Thomas Thebaud, Jesús Antonio Villalba López, Najim Dehak, Patrick Cardinal:

Multi-Target Backdoor Attacks Against Speaker Recognition. 1-7 - Henry Grafé, Hugo Van hamme:

Graph Connectionist Temporal Classification for Phoneme Recognition. 1-6 - Xiaodan Chen, Xiaoxue Gao, Mathias Quoy, Alexandre Pitti, Nancy F. Chen:

Confidence-Based Self-Training for EMG-to-Speech: Leveraging Synthetic EMG for Robust Modeling. 1-8 - Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly:

ChipChat: Low-Latency Cascaded Conversational Agent in MLX. 1-4 - Mingyue Huo, Yuheng Zhang, Yan Tang:

Identifying and Calibrating Overconfidence in Noisy Speech Recognition. 1-7 - Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, Lei Xie:

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers. 1-6 - Shucong Zhang, Titouan Parcollet, Rogier C. van Dalen, Sourav Bhattacharya:

Benchmarking Rotary Position Embeddings for Automatic Speech Recognition. 1-7 - Jui-Chiang Wei, Yi-Cheng Lin, Fabian Ritter-Gutierrez, Hung-Yi Lee:

Multi-Distillation from Speech and Music Representation Models. 1-8 - Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ivan Bulyko:

Group Relative Policy Optimization for Speech Recognition. 1-7 - Jakaria Islam Emon, Md Abu Salek, Kazi Tamanna Alam:

WhisQ: Cross-Modal Representation Learning for Text-to-Music MOS Prediction. 1-4 - Yeseul Park, Bowon Lee:

Towards Scalable and Robust Multilingual ASR for Indian Languages with MixLoRA-Whisper. 1-4 - Jeremy H. M. Wong, Muhammad Huzaifah, Nancy F. Chen, Ai Ti Aw:

Speech in-context learning of paralinguistic tasks. 1-7 - Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly:

Omni-Router: Sharing Routing Decisions in Sparse Mixture-of-Experts for Speech Recognition. 1-8 - Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Ryandhimas E. Zezario, Szu-Wei Fu, Sung-Feng Huang, Erica Cooper, Haibin Wu, Hung-Yu Wei, Hsin-Min Wang, Hung-Yi Lee, Yu Tsao:

HighRateMOS: Sampling-Rate Aware Modeling for Speech Quality Assessment. 1-4 - Wangjin Zhou, Yizhou Zhang, Keisuke Imoto, Tatsuya Kawahara:

KyotoMOS2: MOS Prediction for Speech Across Multiple Sampling Rates. 1-4 - Jinsung Yoon, Wooyeol Jeong, Jio Gim, Young-Joo Suh:

Maestro-EVC: Controllable Emotional Voice Conversion Guided by References and Explicit Prosody. 1-7 - Insung Ham, Bonwha Ku, Hanseok Ko:

EmoBiMamba-TTS: Bidirectional State Space Model for Emotion-Intensity Controllable Text-to-Speech. 1-8 - Qiongqiong Wang, Hardik Bhupendra Sailor, Jeremy H. M. Wong, Tianchi Liu, Shuo Sun, Wenyu Zhang, Muhammad Huzaifah, Nancy F. Chen, Ai Ti Aw:

Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models. 1-8

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














