主页 | 谢磊

谢磊，西北工业大学计算机学院教授、博士生导师，音频语音与语言处理实验室（ASLP@NPU）负责人。主要研究方向包括语音处理、对话式人工智能，以及面向语音与语言技术的先进神经网络模型与大模型技术，在语音增强、自动语音识别、语音合成与语音对话等领域开展了系统性研究。

他长期致力于建设面向学术界的开源工具与数据资源，指导了被广泛使用的 WeNet 语音识别工具包和 WenetSpeech 开源语音数据系列等项目。

他曾获得多项荣誉，包括教育部新世纪优秀人才支持计划、陕西省青年科技新星、全球前2%顶尖科学家（斯坦福大学 & Elsevier）以及华为云 AI 名师等。已发表论文 400 余篇，Google Scholar 引用超过 18000 次，H-index 为 63。曾获多项国际会议最佳论文奖及国际评测冠军，诸多研究成果已实现产业落地。现任 ISCA SIG-CSLP 副主席，并担任 IEEE/ACM TASLP 与 IEEE SPL 的高级领域编委（SAE）。

邮箱：lxie@nwpu.edu.cn
地址：西安市长安区西北工业大学长安校区计算机学院 207 室

展开详细简介

谢磊，西北工业大学计算机学院教授、博士生导师，音频语音与语言处理实验室（ASLP@NPU）负责人。其研究聚焦于语音处理、对话式人工智能，以及面向语音与语言技术的先进神经网络模型，在语音增强、自动语音识别和语音合成等方向做出了重要贡献。

他也长期致力于面向学术界建设开源研究基础设施，指导了被广泛使用的 WeNet 语音识别工具包以及 WenetSpeech 开源语音数据系列等项目。

谢磊博士于西北工业大学获得计算机工程博士学位，博士阶段主要从事语音识别研究。在加入西北工业大学任教之前，曾在比利时布鲁塞尔自由大学（Vrije Universiteit Brussel）、香港城市大学和香港中文大学从事科研工作。

他曾获得多项荣誉，包括教育部新世纪优秀人才支持计划、陕西省青年科技新星、全球前2%顶尖科学家（斯坦福大学 & Elsevier）以及华为云 AI 名师等。

谢磊教授已在音频、语音与语言处理领域发表400余篇同行评议论文，Google Scholar 引用超过 18000 次，H-index 为 63。其研究成果曾多次获得国际学术会议最佳论文奖，并在多项国际评测竞赛中取得冠军。诸多研究成果也已成功应用于产业实践。

在 ASLP@NPU，他指导着一批背景多元的学生和研究人员，围绕语音、音频与语言智能开展前沿研究。他也长期活跃于国际学术共同体，担任多个学术组织和期刊的重要职务。目前，他担任国际语音通信协会 ISCA 中文口语语言处理兴趣组（SIG-CSLP）副主席，以及 IEEE/ACM Transactions on Audio, Speech, and Language Processing 和 IEEE Signal Processing Letters 的高级领域编委（Senior Area Editor）。

高光成果

开源！SoulX-Transcriber——面向复杂对话的端到端多说话人语音转录框架来啦！

详细了解 >

SmoothConv & DuplexConv：面向对话式 AI 的大规模中文全双工语音数据集开源！

详细了解 >

新闻公告

Jun 06, 2026	11 篇论文被语音研究旗舰会议 Interspeech 2026 录用！
Jun 06, 2026	我们联合 Soul APP 和 MoonStep AI 开源了多说话人语音转写模型 SoulX-Transcriber，在多个公开基准上获得最佳性能！
May 19, 2026	我们联合 WeNet 开源社区推出了 S2Accompanist，以402M轻量参数斩获 ICME 2026 ATTM 效率赛道冠军！
May 19, 2026	我们很高兴宣布，第二届多语言对话式语音语言模型挑战赛（MLC-SLM）设立总计 2 万美元奖金池。欢迎参与挑战，赢取大奖！
Apr 20, 2026	IEEE SLT 2026 SmartGlasses 挑战赛盛大开启！聚焦第一视角下的真实社交语音交互
Apr 10, 2026	2026 届硕士同学顺利毕业，人均 6+offer，获选腾讯青云计划，京东顶尖青年技术天才计划等，入职阿里巴巴（Alibaba）、腾讯（Tencent）、京东（JD.com）等业界头部企业或读博深造。祝贺！
Apr 07, 2026	WenetSpeech-Wu —— 迄今为止最大的吴语数据集（Wu Chinese dataset），已被 ACL 2026 接收
Apr 07, 2026	LLM-forced Aligner —— Qwen3-ForcedAligner 背后的核心技术，已被 ACL 2026 接收
Apr 05, 2026	恭喜姚继珣博士获得腾讯青云计划，入职腾讯！
Mar 17, 2026	4 篇论文被 ICME 2026 录用

实验室

音频语音与语言处理实验室（ASLP@NPU）由西北工业大学谢磊教授领衔，是国内外语音、音频与语言智能领域具有广泛影响力和知名度高的研究团队。实验室围绕语音识别、语音合成、语音增强、口语对话系统以及新兴音频语言模型等方向开展前沿研究，始终坚持学术创新与实际应用并重。

ASLP@NPU 高度重视科研成果的工程化与产业落地，长期与工业界保持紧密而深入的合作关系。实验室多项研究成果已成功应用于实际场景，所建设的 WeNet 工具平台与 WenetSpeech 数据资源也已被学术界和工业界广泛采用。

实验室同时高度重视人才培养，已为语音与人工智能领域培养了大批优秀人才，众多毕业生和成员已成长为头部科技企业和科研机构中的技术领军人物、资深研究人员与核心技术骨干。

通过融合学术深度、工程能力与产业视野，ASLP@NPU 持续推动语音智能与下一代人机交互技术的发展。

开源项目概览

SoulX-Transcriber — 面向复杂对话的端到端多说话人语音转录框架
YingMusic-Singer+ — 支持灵活歌词操控与无标注旋律引导的可控歌声合成
SoulX-Podcast — 基于文本生成高保真播客，支持多人对话、多种方言
DiffRhythm — 基于潜在扩散的端到端全长歌曲生成模型
OSUM — 面向学术有限资源的开放语音理解模型
SongEval — 歌曲美学评估工具包
WenetSpeech-Yue — 大规模多维度标注粤语语音语料库
MeanVC — 基于均值流的轻量级流式零样本语音转换
VoiceSculptor — 基于 LLaSA 和 CosyVoice2 的指令式语音合成方案
WenetSpeech-Chuan — 大规模四川方言语音语料库
DiffRhythm2 — 基于块流匹配的高效高保真歌曲生成
WenetSpeech-Wu-Repo — 大规模吴方言语音语料库
SongFormer — 超快超准音乐标注神器

近期论文

Collaborators

Interspeech

OmniCodec: Low Frame Rate Universal Audio Codec with Semantic-Acoustic Disentanglement

Jingbin Hu, Haoyu Zhang, Dake Guo, Qirui Zhan, Wenhao Li, Huakang Chen, and 7 more authors

In Interspeech, 2026

Abstract arXiv

Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available.
Interspeech

Joint Learning Global-Local Speaker Classification to Enhance End-to-End Speaker Diarization and Recognition

Yuhang Dai, Haopeng Lin, Jiale Qian, Ruiqi Yan, Hao Meng, Hanke Xie, and 6 more authors

In Interspeech, 2026

Abstract arXiv

Large Audio-Language Models (LALMs) have demonstrated remarkable performance in end-to-end speaker diarization and recognition. However, their speaker discriminability remains limited due to the scarcity of large-scale conversational data and the absence of explicit speaker representation optimization. To address this, we propose GLSC-SDR, a paradigm that jointly trains speaker classification with diarization and recognition. We further introduce a Global-Local Speaker Classification strategy, which uses clustered speakers as global labels and re-encoded intra-cluster speakers as local labels. This hierarchical design enhances fine-grained speaker discrimination while preserving semantic transcription accuracy. Experiments on AliMeeting, AISHELL-4, and AMI-SDM demonstrate that GLSC-SDR achieves competitive or superior performance compared to simulation-based and multi-encoder approaches, without relying on large-scale real conversational data.
Interspeech

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, and 7 more authors

In Interspeech, 2026

Abstract arXiv

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.
Interspeech

YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

Chunbo Hao, Junjie Zheng, Guobin Ma, Yuepeng Jiang, Huakang Chen, Wenjie Tian, and 3 more authors

In Interspeech, 2026

Abstract arXiv

Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer-Plus, a fully diffusion-based model enabling melody-controllable singing voice synthesis with flexible lyric manipulation. The model takes three inputs: an optional timbre reference, a melody-providing singing clip, and modified lyrics, without manual alignment. Trained with curriculum learning and Group Relative Policy Optimization, YingMusic-Singer-Plus achieves stronger melody preservation and lyric adherence than Vevo2, the most comparable baseline supporting melody control without manual alignment. We also introduce LyricEditBench, the first benchmark for melody-preserving lyric modification evaluation. The code, weights, benchmark, and demos are publicly available.
Interspeech

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

Hanke Xie, Xiaming Ren, Dake Guo, Ruonan You, Wenhao Li, Jingbin Hu, and 7 more authors

In Interspeech, 2026

Abstract arXiv

Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.
Interspeech

G-MaP-SE: Guided Speech Enhancement via GMM-Based Prior Matching

Yike Zhu, Ziqian Wang, Zikai Liu, Xingchen Li, Zhuangqi Chen, Xianjun Xia, and 2 more authors

In Interspeech, 2026

Abstract arXiv

Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.
Interspeech

Beyond Semantic Dominance: Cognitive Affective Reasoning and Empathetic Response Alignment in Audio Language Models

Zhixian Zhao, Shuiyuan Wang, Wenjie Tian, Jingbin Hu, Ziyu Zhang, and Lei Xie

In Interspeech, 2026

Abstract arXiv

While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a “lexically-identical, multi-emotion” dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.
Interspeech

DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching

Yuepeng Jiang, Huakang Chen, Ziqian Ning, Jixun Yao, Zerui Han, Di Wu, and 4 more authors

In Interspeech, 2026

Abstract arXiv

Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.
Interspeech

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

Guobin Ma, Yuxuan Xia, Yuepeng Jiang, Dake Guo, Hanke Xie, Jingbin Hu, and 3 more authors

In Interspeech, 2026

Abstract arXiv

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.
Interspeech

MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios

Zhaokai Sun, Shuai Wang, Zhennan Lin, Chengyou Wang, Dehui Guo, Yunang Cao, and 3 more authors

In Interspeech, 2026

Abstract arXiv

Spoken Language Understanding (SLU) has progressed from traditional single-task methods to large audio language model (LALM) solutions. Yet, most existing speech benchmarks focus on single-speaker or isolated tasks, overlooking the challenges posed by multi-speaker conversations that are common in real-world scenarios. We introduce MSU-Bench, a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. Our hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers. By evaluating state-of-the-art models on MSU-Bench, we demonstrate that as task complexity increases across the benchmark’s tiers, all models exhibit a significant performance decline. We also observe a persistent capability gap between open-source models and closed-source commercial ones, particularly in multi-speaker interaction reasoning. These findings validate the effectiveness of MSU-Bench for assessing and advancing conversational understanding in realistic multi-speaker environments. Demos can be found in the supplementary material.
ICME

Integrating Fine-Grained Audio-Visual Evidence for Robust Multimodal Emotion Reasoning

Zhixian Zhao, Wenjie Tian, Xiaohai Tian, Jun Zhang, and Lei Xie

In ICME, 2026

Abstract arXiv

Multimodal emotion analysis is shifting from static classification to generative reasoning. Beyond simple label prediction, robust affective reasoning must synthesize fine-grained signals such as facial micro-expressions and prosodic which shifts to decode the latent causality within complex social contexts. However, current Multimodal Large Language Models (MLLMs) face significant limitations in fine-grained perception, primarily due to data scarcity and insufficient cross-modal fusion. As a result, these models often exhibit unimodal dominance which leads to hallucinations in complex multimodal interactions, particularly when visual and acoustic cues are subtle, ambiguous, or even contradictory. To address this, we introduce SABER-LLM, a framework designed for robust multimodal reasoning. First, we construct SABER, a large-scale emotion reasoning dataset comprising 600K video clips, annotated with a novel six-dimensional schema that jointly captures audiovisual cues and causal logic. Second, we propose the structured evidence decomposition paradigm, which enforces a “perceive-then-reason” separation between evidence extraction and reasoning to alleviate unimodal dominance. The ability to perceive complex scenes is further reinforced by consistency-aware direct preference optimization, which explicitly encourages alignment among modalities under ambiguous or conflicting perceptual conditions. Experiments on EMER, EmoBench-M, and SABER-Test demonstrate that SABER-LLM significantly outperforms open-source baselines and achieves robustness competitive with closed-source models in decoding complex emotional dynamics. The dataset and model are available.
ICME

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Yanbo Wang, and 2 more authors

In ICME, 2026

Abstract arXiv

Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced.
ICME

S2Accompanist: A Semantic-Aware and Structure-Guided Diffusion Model for Music Accompaniment Generation

Huakang Chen, Wenkai Cheng, Guobin Ma, Chunbo Hao, Yuxuan Xia, Mengqi Wei, and 4 more authors

In ICME, 2026

Abstract arXiv

High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coarse, track-level annotations. To address these limitations under constrained data and computing resources, we propose S2Accompanist, a Semantic-Aware and Structure-Guided Diffusion Model developed for the ICME2026 ATTM Grand Challenge. Specifically, we design an automated data pipeline comprising structural segmentation, Large Audio-Language Model driven segment-level captioning, and dual-metric quality grading to overcome the absence of localized metadata in raw datasets. Furthermore, we propose a semantic-aware Variational Autoencoder fine-tuning strategy that explicitly distills foundational LeadSheet structures into the acoustic latent space, effectively improving the overall audio fidelity. Extensive experiments demonstrate that S2Accompanist achieves state-of-the-art objective performance on the ATTM Grand Challenge benchmark across both the Efficiency and Performance Tracks. With only 402M parameters, our model remains competitive compared to larger-scale unconstrained models and secured first place in the Efficiency Track.
ICME

SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

Xingchen Li, Hanke Xie, Ziqian Wang, Zihan Zhang, Longshuai Xiao, and Lei Xie

In ICME, 2026

Abstract arXiv

Generative Universal Speech Enhancement (USE) methods aim to leverage generative models to improve speech quality under various types of distortions. However, existing generative speech enhancement methods often suffer from semantic inconsistency in the generated outputs. Therefore, we propose SenSE, a novel two-stage generative universal speech enhancement framework, by modeling semantic priors with a language model, the flow matching-based speech enhancement process is guided to generate semantically faithful speech, thereby effectively improving context fidelity. In addition, we introduce a dual-path masked conditioning training strategy that enables flow matching-based enhancement to flexibly integrate multi-source conditioning signals from degraded speech, semantic tokens, and reference speech, thereby improving model flexibility and adaptability. Experimental results demonstrate that SenSE achieves state-of-the-art performance among generative speech enhancement models and exhibits a high performance ceiling, particularly under challenging distortion conditions. Codes and demos are available.
ACL

WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem

Chengyou Wang, Mingchen Shao, Jingbin Hu, Zeyu Zhu, Hongfei Xue, Bingshen Mu, and 8 more authors

In ACL, 2026

Abstract arXiv

Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence.
ACL

LLM-ForcedAligner: A Non-Autoregressive and Accurate LLM-Based Forced Aligner for Multilingual and Long-Form Speech

Bingshen Mu, Xian Shi, Xiong Wang, Hexin Liu, Jin Xu, and Lei Xie

In ACL, 2026

Abstract arXiv

Forced alignment (FA) predicts start and end timestamps for words or characters in speech, but existing methods are language-specific and prone to cumulative temporal shifts. The multilingual speech understanding and long-sequence processing abilities of speech large language models (SLLMs) make them promising for FA in multilingual, crosslingual, and long-form speech settings. However, directly applying the next-token prediction paradigm of SLLMs to FA results in hallucinations and slow inference. To bridge the gap, we propose LLM-ForcedAligner, reformulating FA as a slot-filling paradigm: timestamps are treated as discrete indices, and special timestamp tokens are inserted as slots into the transcript. Conditioned on the speech embeddings and the transcript with slots, the SLLM directly predicts the time indices at slots. During training, causal attention masking with non-shifted input and label sequences allows each slot to predict its own timestamp index based on itself and preceding context, with loss computed only at slot positions. Dynamic slot insertion enables FA at arbitrary positions. Moreover, non-autoregressive inference is supported, avoiding hallucinations and improving speed. Experiments across multilingual, crosslingual, and long-form speech scenarios show that LLM-ForcedAligner achieves a 69% 78% relative reduction in accumulated averaging shift compared with prior methods. Checkpoint and inference code are available at https://github.com/QwenLM/Qwen3-ASR.
ICASSP

Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods

Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, and 5 more authors

In ICASSP, 2026

Abstract arXiv Code

This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
ICASSP

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

Yuhang Dai, Ziyu Zhang, Shuai Wang, Longhao Li, Zhao Guo, Tianlun Zuo, and 10 more authors

In ICASSP, 2026

Abstract arXiv Code Demo HIGHLIGHT

The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus’s effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.
ICASSP

Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

Mingchen Shao, Bingshen Mu, Chengyou Wang, Hai Li, Ying Yan, Zhonghua Fu, and 1 more author

In ICASSP, 2026

Abstract arXiv

Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such as Thai. This limitation arises from three factors: (1) existing commonly used speech encoders, like the Whisper family, underperform in low-resource languages and lack support for broader spoken language understanding tasks; (2) the ASR-based alignment paradigm requires training the entire SLLM, leading to high computational cost; (3) paired speech-text data in low-resource languages is scarce. To overcome these challenges in the low-resource language Thai, we introduce XLSR-Thai, the first self-supervised learning (SSL) speech encoder for Thai. It is obtained by continuously training the standard SSL XLSR model on 36,000 hours of Thai speech data. Furthermore, we propose U-Align, a speech-text alignment method that is more resource-efficient and multitask-effective than typical ASR-based alignment. Finally, we present Thai-SUP, a pipeline for generating Thai spoken language understanding data from high-resource languages, yielding the first Thai spoken language understanding dataset of over 1,000 hours. Multiple experiments demonstrate the effectiveness of our methods in building a Thai multitask-understanding SLLM. We open-source XLSR-Thai and Thai-SUP to facilitate future research.
ICASSP

MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, and 1 more author

In ICASSP, 2026

Abstract arXiv Code Demo HIGHLIGHT

Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.

全部论文 →

学术兼职

Senior Area Editor, IEEE/ACM Transactions on Audio, Speech, and Language Processing
Senior Area Editor, IEEE Signal Processing Letters
Member, IEEE Speech and Language Processing Technical Committee (SLTC)
Vice Chairperson (2022–2024), ISCA Special Interest Group on Chinese Spoken Language Processing (SIG-CSLP)
Board Member (2020–2023), APSIPA Speech and Language Processing (SLP) Technical Committee

获奖

季军, Single Track, Interspeech 2026 Audio Reasoning Challenge
冠军, In-Domain Singing Style Conversion Track, ASRU 2025 The Singing Voice Conversion Challenge
冠军, Zero-Shot Singing Style Conversion Track, ASRU 2025 The Singing Voice Conversion Challenge
冠军, 通用音频分离赛道, NCMMSC 2025 CCF 先进音频技术竞赛
亚军, Target Speaker Lipreading Track, ICME 2024 Chat-scenario Chinese Lipreading (ChatCLR) Challenge
冠军, Source Speaker Verification Against Voice Conversion Track, SLT 2024 Source Speaker Tracing Challenge（SSTC）
冠军, ICASSP 2024 Packet Loss Concealment (PLC) Challenge
亚军, Real-time Track, ICASSP 2024 Speech Signal Improvement Challenge
季军, Non-real-time Track, ICASSP 2024 Speech Signal Improvement Challenge
亚军, ICASSP 2024 Multimodal Information based Speech Processing (MISP) Challenge
冠军, 多模态远距离拾音赛道，2024中国声学学会声华杯声学技术大赛
冠军, 单说话人视觉语音识别赛道, NCMMSC 2024 中文连续视觉语音识别挑战赛 (CNVSRC)
冠军, 多说话人视觉语音识别赛道, NCMMSC 2024 中文连续视觉语音识别挑战赛 (CNVSRC)
冠军, SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge(LRDWWS Challenge)
冠军, Speech-to-Speech Translation (Offline) Track, ACL 2023 Speech-to-Speech Translation (S2ST)
冠军, Any-to-one, In-domain Singing Voice Conversion Track, ASRU 2023 The Singing Voice Conversion Challenge
亚军, Any-to-one, Cross-domain Singing Voice Conversion Track, ASRU 2023 The Singing Voice Conversion Challenge
亚军, Audio-Visual Target Speaker Extraction (AVTSE) Track, ICASSP 2023 Multi-modal Information based Speech Processing (MISP) Challenge
冠军, UDASE (Unsupervised Domain Adaptation for Speech Enhancement) Track, Interspeech 2023 CHiME Speech Separation and Recognition Challenge (CHiME-7)
冠军, Non-personalized AEC Track, ICASSP 2023 Acoustic Echo Cancellation Challenge (AEC Challenge)
亚军, Personalized AEC Track, ICASSP 2023 Acoustic Echo Cancellation Challenge (AEC Challenge)
亚军, Audio-Visual Diarization & Recognition Track, ICASSP 2023 Multimodal Information based Speech Processing (MISP) - Challenge
季军, Audio-Visual Speaker Diarization Track, ICASSP 2023 Multimodal Information based Speech Processing (MISP) Challenge
冠军, Headset Speech Enhancement Track, ICASSP 2023 Deep Noise Suppression Challenge
冠军, Speakerphone Speech Enhancement Track, ICASSP 2023 Deep Noise Suppression Challenge
冠军, 语音增强赛道, 2023 声华杯声学技术大赛
冠军, ASRU 2023 MultiLingual Speech processing Universal PERformance Benchmark (SUPERB)
冠军, 单说话人视觉语音识别赛道, NCMMSC 2023 中文连续视觉语音识别挑战赛 (CNVSRC)
冠军, 多说话人视觉语音识别赛道, NCMMSC 2023 中文连续视觉语音识别挑战赛 (CNVSRC)
2023年度华为云AI名师奖
冠军, Speaker Anonymization Track, Interspeech 2022 VoicePrivacy 2022 Challenge (VPC 2022)
亚军, Fully-supervised Track, Interspeech 2022 Far-field Speaker Verification Challenge (FFSVC)
亚军, Semi-supervised Track, Interspeech 2022 Far-field Speaker Verification Challenge (FFSVC)
亚军, ISCSLP 2022 Magichub Code-Switching ASR Challenge
季军, ISCSLP 2022 Conversational Short-phrase Speaker Diarization Challenge
冠军, Constrained Track, O-COCOSDA 2022 Indic Multilingual Speaker Verification Challenge (I-MSV)
季军, Unconstrained Track, O-COCOSDA 2022 Indic Multilingual Speaker Verification Challenge (I-MSV)
季军, NCMMSC 2022 面向蒙古语的低资源语音合成竞赛
2022年度华为云优秀合作伙伴奖
2022华为云优秀创新合作团队
亚军, Training with VoxCeleb 1/2 Only Track, VoxSRC 2021 Workshop 2021 VoxCeleb Speaker Recognition Challenge (VoxSRC)
亚军, Additional Public Data Allowed (e.g., MUSAN, RIR) Track, VoxSRC 2021 Workshop 2021 VoxCeleb Speaker Recognition - Challenge (VoxSRC)
季军, Real-Time Wideband Speech Enhancement Track, Interspeech 2021 Deep Noise Suppression Challenge (DNS Challenge)
季军, Real-Time AEC & Speech Enhancement Track, Interspeech 2021 Acoustic Echo Cancellation Challenge (AEC Challenge)
冠军, Close-talking Single-channel Track, ISCSLP 2021 Personalized Voice Trigger Challenge (PVTC)
冠军, Real-Time Wideband Speech Enhancement Track, Interspeech 2020 Deep Noise Suppression Challenge (DNS Challenge)
亚军, Non-Real-Time Wideband Speech Enhancement Track, Interspeech 2020 Deep Noise Suppression Challenge (DNS Challenge)
华为2020年优秀技术成果合作奖
冠军, Closed-set Word-level Audio-Visual Speech Recognition Track, ICMI 2019 Mandarin Audio-Visual Speech Recognition - Challenge
2019-2020年度美团科研合作实践奖
季军, Interspeech 2018 CHiME Speech Separation and Recognition Challenge (CHiME-5)
亚军, Unsupervised Subword Unit Modeling Track, Interspeech 2017 Zero Resource Speech Challenge
冠军, Spoken Term Discovery Track, Interspeech 2015 Zero Resource Speech Challenge
冠军, QUESST (Query-by-Example Speech Search) Track, MediaEval Multimedia Benchmark Workshop 2015 Query-by-Example Search on Speech Task (QUESST)
亚军, QUESST (Query-by-Example Speech Search) Track, MediaEval Multimedia Benchmark Workshop 2014 Query-by-Example Search on Speech Task (QUESST)