MDL-0.6B
Related work: 🛠️ MiDashengLM · 📚 ACAVCaps Dataset · 📊 MECAT Benchmark
The MiDashengLM-0.6B (MDL-0.6B) represents a strategic shift toward high-density, on-device multimodal intelligence. As a compact and efficient audio-language model, it achieves holistic audio understanding. By pairing the Qwen3-0.6B backbone with the dasheng-base-tokenizer, we have developed a model that remains substantially smaller than MDL-7B yet delivers superior performance in several discriminative tasks.
A pivotal factor in its success is Null-Instruction Modality Alignment strategy, which uses caption data without task-specific prompts to anchor audio-text representations before the supervised fine-tuning phase. Despite its lightweight architecture (0.6B parameters), it delivers competitive performance across various audio understanding benchmarks, making it highly suitable for lightweight deployments and efficient inference.
This repository provides the model weights, inference code, and preliminary evaluation results.
Usage
Load Model
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_id = "mispeech/midashenglm-0.6b-fp32" # Replace with your actual model id
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
Construct Prompt
user_prompt = "Write the detailed caption about this audio within 1-2 sentences." # You may try any other prompt
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": user_prompt},
{
"type": "audio",
"path": "/path/to/example.wav",
# or "url": "https://example.com/example.wav"
},
],
},
]
Generate Output
import torch
sample_kwargs = dict(
do_sample=True,
top_p=0.8,
top_k=50,
temperature=1.0,
repetition_penalty=1.05,
)
with torch.no_grad():
model_inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
add_special_tokens=True,
return_dict=True,
).to(device=model.device, dtype=model.dtype)
generation = model.generate(**model_inputs, **sample_kwargs)
output = tokenizer.batch_decode(generation, skip_special_tokens=True)
print(output)
Results
The following tables present the preliminary evaluation results of MDL-0.6B (Checkpoint: 75W). We compare our compact model against the baseline MiDashengLM-7B-1021, as well as two multimodal large language models: Qwen2.5-Omni-7B and Kimi-Audio-Instruct.
Audio Captioning Results
We first evaluate on MECAT-Caption, which organizes captions into three strands. Systemic Captions comprise a concise short caption centered on the primary audio content and a long caption that adds contextual detail and how events interact. Content-Specific Captions use three branches—speech, music, and sound events—evaluated independently; the table reports pure vs. mixed variants for each. The Content-Unrelated Caption strand focuses on acoustic properties (e.g., recording quality and reverberation) rather than semantic scene content. Metrics are reported with DATE↑. Beyond MECAT-Caption, we report standard music captioning on MusiCaps and SongDescriber (FENSE↑) and general environmental / audio captioning on AudioCaps (dev), Clotho (test), and AutoACD (test) (FENSE↑).
If you are interested in the datasets and evaluation used for audio caption training and benchmarking, see ACAVCaps and MECAT.
| Dataset | Domain | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|---|
| MECAT-Caption | Long | DATE↑ | 49.50 | 61.10 | 72.50 | 75.60 |
| MECAT-Caption | Short | DATE↑ | 54.20 | 56.50 | 72.30 | 74.70 |
| MECAT-Caption | Pure Speech | DATE↑ | 30.00 | 39.90 | 64.40 | 64.00 |
| MECAT-Caption | Mixed Speech | DATE↑ | 31.30 | 40.90 | 59.90 | 64.30 |
| MECAT-Caption | Pure Music | DATE↑ | 27.70 | 32.10 | 58.30 | 57.60 |
| MECAT-Caption | Mixed Music | DATE↑ | 16.90 | 30.90 | 36.10 | 58.20 |
| MECAT-Caption | Pure Sound | DATE↑ | 43.10 | 50.70 | 57.50 | 58.40 |
| MECAT-Caption | Mixed Sound | DATE↑ | 16.20 | 23.80 | 23.00 | 42.40 |
| MECAT-Caption | Environment | DATE↑ | 7.00 | 17.90 | 26.90 | 31.20 |
| MusiCaps | Music | FENSE↑ | 35.43 | 43.71 | 59.11 | 60.70 |
| SongDescriber | Music | FENSE↑ | 44.63 | 45.31 | 46.62 | 51.90 |
| AudioCaps-Dev | Sound | FENSE↑ | 49.00 | 60.79 | 62.13 | 59.70 |
| Clotho-Test | Sound | FENSE↑ | 48.01 | 47.55 | 49.35 | 44.30 |
| AutoACD-Test | Sound | FENSE↑ | 44.76 | 55.93 | 67.13 | 59.20 |
Metrics: Higher is better. "-" denotes data not available.
Audio and Paralinguistic Classification
| Dataset | Task | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|---|
| VoxCeleb1 | Speaker ID | ACC↑ | 82.72 | 59.71 | 92.66 | 90.84 |
| VoxLingua107 | Language ID | ACC↑ | 73.65 | 51.03 | 93.72 | 86.39 |
| VoxCeleb-Gender | Gender ID | ACC↑ | 99.69 | 99.82 | 97.72 | 96.80 |
| VGGSound | Sound Event | MAP↑ | 2.20 | 0.97 | 52.19 | 28.05 |
| CochlScene | Sound Scene | ACC↑ | 18.34 | 23.88 | 75.81 | 75.78 |
| NSynth-Instrument | Music Instrument | ACC↑ | 38.09 | 60.45 | 80.32 | 64.33 |
| FreeMusicArchive | Music Genre | ACC↑ | 27.91 | 66.77 | 62.94 | 17.50 |
| FSDKaggle2018 | Sound Event | MAP↑ | 24.75 | 31.38 | 73.38 | 80.84 |
| AudioSet | Sound Event | MAP↑ | 3.47 | 6.48 | 9.90 | 6.82 |
| FSD50K | Sound Event | MAP↑ | 27.23 | 23.87 | 38.10 | 34.15 |
Metrics: Higher is better.
ASR Performance
| Dataset | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|
| LibriSpeech-Clean | WER↓ | 1.30 | 1.70 | 3.60 | 4.41 |
| LibriSpeech-Other | WER↓ | 2.40 | 3.40 | 5.90 | 10.66 |
| People's Speech | CER↓ | 22.30 | 28.60 | 26.12 | 29.15 |
| AISHELL-2-Mic | CER↓ | 2.70 | 2.50 | 3.20 | 6.30 |
| AISHELL-2-iOS | CER↓ | 2.60 | 2.60 | 2.90 | 5.67 |
| AISHELL-2-Android | CER↓ | 2.60 | 2.70 | 3.10 | 7.37 |
| GigaSpeech2-Indonesian | WER↓ | >100 | 21.20 | 22.30 | 26.00 |
| GigaSpeech2-Thai | WER↓ | >100 | 53.80 | 38.40 | 23.20 |
| GigaSpeech2-Viet | WER↓ | >100 | 18.60 | 17.70 | 68.88 |
Metrics: WER/CER (lower is better).
Question Answering Results
We first evaluate MECAT-QA, which targets reasoning and assessment over audio (subsets such as direct perception, sound characteristics, quality assessment, environment reasoning, inference / judgment, and application-oriented content), reported with DATE. Notably, the compact MDL-0.6B model improves markedly on these MECAT-QA subsets compared to the 7B MiDashengLM-7B-1021 model, reflecting strong zero-shot reasoning, logical inference, and fine-grained analysis despite its smaller size.
MMAU-Pro follows, using the Answer task and ACC↑ across capability splits (IF, Multi-Audio, Music, Open-Ended, Sound, cross-modal combinations, Spatial, Speech, Voice, and the overall Average).
Finally, we include additional QA benchmarks: AudioCaps-QA and MusicQA (FENSE↑), and MuChoMusic (ACC↑).
| Dataset | Subset | Metric | Kimi-Audio-Instruct | Qwen2.5-Omni-7B | MiDashengLM-7B-1021 | MiDashengLM-0.6B |
|---|---|---|---|---|---|---|
| MECAT-QA | Direct Perception | DATE | 45.60 | 57.80 | 64.20 | 71.70 |
| MECAT-QA | Sound Characteristics | DATE | 39.20 | 52.90 | 31.20 | 69.30 |
| MECAT-QA | Quality Assessment | DATE | 18.70 | 39.10 | 20.20 | 57.50 |
| MECAT-QA | Environment Reasoning | DATE | 34.60 | 44.00 | 20.10 | 66.30 |
| MECAT-QA | Inference / Judgment | DATE | 48.90 | 53.20 | 35.30 | 66.40 |
| MECAT-QA | Application Content | DATE | 41.20 | 50.80 | 33.60 | 65.80 |
| MMAU-Pro | IF | ACC↑ | 42.30 | 61.30 | 37.93 | 56.32 |
| MMAU-Pro | Multi-Audio | ACC↑ | 17.20 | 24.30 | 42.33 | 35.12 |
| MMAU-Pro | Music | ACC↑ | 57.60 | 61.50 | 62.20 | 38.15 |
| MMAU-Pro | Open-Ended | ACC↑ | 34.50 | 52.30 | 63.21 | 50.21 |
| MMAU-Pro | Sound | ACC↑ | 46.00 | 47.60 | 58.36 | 33.24 |
| MMAU-Pro | Sound–Music | ACC↑ | 46.00 | 40.00 | 42.00 | 38.00 |
| MMAU-Pro | Sound–Music–Speech | ACC↑ | 42.80 | 28.50 | 71.43 | 42.86 |
| MMAU-Pro | Spatial | ACC↑ | 43.70 | 41.20 | 18.77 | 53.23 |
| MMAU-Pro | Speech | ACC↑ | 52.20 | 57.40 | 61.17 | 33.56 |
| MMAU-Pro | Speech–Music | ACC↑ | 54.30 | 53.20 | 58.70 | 23.91 |
| MMAU-Pro | Speech–Sound | ACC↑ | 48.90 | 60.20 | 51.14 | 30.68 |
| MMAU-Pro | Voice | ACC↑ | 50.60 | 60.00 | 54.83 | 28.97 |
| MMAU-Pro | Average | ACC↑ | 46.60 | 52.20 | 55.92 | 38.69 |
| AudioCaps-QA | — | FENSE↑ | 47.34 | 53.28 | 54.20 | 41.70 |
| MusicQA | — | FENSE↑ | 40.00 | 60.60 | 61.56 | 36.10 |
| MuChoMusic | — | ACC↑ | 67.40 | 64.79 | 73.04 | 35.80 |
Metrics: Higher is better. "-" denotes data not available.
Citation
If you find MDL-0.6B useful in your research or business applications, please cite the underlying work it builds on: MiDashengLM (efficient audio understanding with general audio captions) and DashengTokenizer (unified continuous audio tokenization for understanding and generation).
@techreport{midashenglm7b,
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
author = {{Horizon Team, MiLM Plus}},
institution= {Xiaomi Inc.},
year = {2025},
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
url = {https://arxiv.org/abs/2508.03983},
eprint = {2508.03983},
}
@misc{dinkel2026dashengtokenizer,
title = {DashengTokenizer: One layer is enough for unified audio understanding and generation},
author = {Heinrich Dinkel and Xingwei Sun and Gang Li and Jiahao Mei and Yadong Niu and Jizhong Liu and Xiyang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan},
year = {2026},
eprint = {2602.23765},
archivePrefix= {arXiv},
primaryClass = {cs.SD},
url = {https://arxiv.org/abs/2602.23765},
}
If you are interested in caption datasets and evaluation, see ACAVCaps and MECAT.
@misc{niu2026acavcaps,
title = {ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding},
author = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Junbo Zhang and Jian Luan},
year = {2026},
eprint = {2603.24038},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2603.24038},
}
@misc{niu2025mecat,
title = {MECAT: a multi-experts constructed benchmark for fine-grained audio understanding tasks},
author = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Xiyang Liu and Junbo Zhang and Jian Luan},
year = {2025},
eprint = {2507.23511},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2507.23511},
}
- Downloads last month
- 36