|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B |
|
|
- openai/whisper-large-v3 |
|
|
- utter-project/mHuBERT-147 |
|
|
datasets: |
|
|
- Nexdata/INTERSPEECH_2025_MLC-SLM_Challenge_Dataset |
|
|
- bsmu/MLC-SLM-Eval |
|
|
language: |
|
|
- en |
|
|
- fr |
|
|
- it |
|
|
- ja |
|
|
- ko |
|
|
- vi |
|
|
- th |
|
|
- pt |
|
|
- ru |
|
|
- es |
|
|
- de |
|
|
license: apache-2.0 |
|
|
metrics: |
|
|
- cer |
|
|
- wer |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
tags: |
|
|
- speech-llm |
|
|
- conversational-asr |
|
|
--- |
|
|
|
|
|
# MLC-SLM: Bridging the Gap in Multilingual Conversational ASR |
|
|
|
|
|
This repository contains the models and code presented in the paper [Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR](https://huggingface.co/papers/2601.01461). |
|
|
|
|
|
The project was developed for the INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM). |
|
|
|
|
|
- **Paper:** [arXiv:2601.01461](https://huggingface.co/papers/2601.01461) |
|
|
- **Code:** [GitHub - MLC-SLM](https://github.com/1535176727/MLC-SLM) |
|
|
|
|
|
## Description |
|
|
|
|
|
The proposed **Speech-LLM** is an enhanced framework that integrates fine-tuned Whisper and mHuBERT encoders with a Large Language Model (Qwen2.5-7B) to enrich speech representations for multilingual conversational ASR. It utilizes cross-attention-based fusion mechanisms to exploit complementary information between generative (Whisper) and discriminative (mHuBERT) speech features. |
|
|
|
|
|
## Results |
|
|
|
|
|
Performance (CER/WER) on the MLC-SLM Challenge datasets: |
|
|
|
|
|
| **System** | **Dev** | **Eval** | **CV-Test** | |
|
|
|----------------------------|---------|----------|-------------| |
|
|
| Whisper (LoRA-fine-tuned) | 11.40 | 10.71 | **11.47** | |
|
|
| Whisper (Full-fine-tuned) | **10.99** | **10.07** | 13.11 | |
|
|
| **Proposed Speech-LLM** | 11.74 | 10.69| 15.26 | |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The models were trained on the official ~1500h training set from the MLC-SLM Challenge, covering 11 languages and 15 categories (including various English accents). |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{mlcslm2025bridging, |
|
|
title={Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR}, |
|
|
author={Anonymous Authors}, |
|
|
journal={arXiv preprint arXiv:2601.01461}, |
|
|
year={2025} |
|
|
} |
|
|
``` |