MDL-0.6B

Related work: 🛠️ MiDashengLM · 📚 ACAVCaps Dataset · 📊 MECAT Benchmark

The MiDashengLM-0.6B (MDL-0.6B) represents a strategic shift toward high-density, on-device multimodal intelligence. As a compact and efficient audio-language model, it achieves holistic audio understanding. By pairing the Qwen3-0.6B backbone with the dasheng-base-tokenizer, we have developed a model that remains substantially smaller than MDL-7B yet delivers superior performance in several discriminative tasks.

A pivotal factor in its success is Null-Instruction Modality Alignment strategy, which uses caption data without task-specific prompts to anchor audio-text representations before the supervised fine-tuning phase. Despite its lightweight architecture (0.6B parameters), it delivers competitive performance across various audio understanding benchmarks, making it highly suitable for lightweight deployments and efficient inference.

This repository provides the model weights, inference code, and preliminary evaluation results.

Usage

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-0.6b-fp32" # Replace with your actual model id
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Construct Prompt

user_prompt = "Write the detailed caption about this audio within 1-2 sentences."  # You may try any other prompt

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
            },
        ],
    },
]

Generate Output

import torch

sample_kwargs = dict(
    do_sample=True,
    top_p=0.8,
    top_k=50,
    temperature=1.0,
    repetition_penalty=1.05,
)

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs, **sample_kwargs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)
    print(output)

Results

The following tables present the preliminary evaluation results of MDL-0.6B (Checkpoint: 75W). We compare our compact model against the baseline MiDashengLM-7B-1021, as well as two multimodal large language models: Qwen2.5-Omni-7B and Kimi-Audio-Instruct.

Audio Captioning Results

We first evaluate on MECAT-Caption, which organizes captions into three strands. Systemic Captions comprise a concise short caption centered on the primary audio content and a long caption that adds contextual detail and how events interact. Content-Specific Captions use three branches—speech, music, and sound events—evaluated independently; the table reports pure vs. mixed variants for each. The Content-Unrelated Caption strand focuses on acoustic properties (e.g., recording quality and reverberation) rather than semantic scene content. Metrics are reported with DATE↑. Beyond MECAT-Caption, we report standard music captioning on MusiCaps and SongDescriber (FENSE↑) and general environmental / audio captioning on AudioCaps (dev), Clotho (test), and AutoACD (test) (FENSE↑).

If you are interested in the datasets and evaluation used for audio caption training and benchmarking, see ACAVCaps and MECAT.

Dataset Domain Metric Kimi-Audio-Instruct Qwen2.5-Omni-7B MiDashengLM-7B-1021 MiDashengLM-0.6B
MECAT-Caption Long DATE↑ 49.50 61.10 72.50 75.60
MECAT-Caption Short DATE↑ 54.20 56.50 72.30 74.70
MECAT-Caption Pure Speech DATE↑ 30.00 39.90 64.40 64.00
MECAT-Caption Mixed Speech DATE↑ 31.30 40.90 59.90 64.30
MECAT-Caption Pure Music DATE↑ 27.70 32.10 58.30 57.60
MECAT-Caption Mixed Music DATE↑ 16.90 30.90 36.10 58.20
MECAT-Caption Pure Sound DATE↑ 43.10 50.70 57.50 58.40
MECAT-Caption Mixed Sound DATE↑ 16.20 23.80 23.00 42.40
MECAT-Caption Environment DATE↑ 7.00 17.90 26.90 31.20
MusiCaps Music FENSE↑ 35.43 43.71 59.11 60.70
SongDescriber Music FENSE↑ 44.63 45.31 46.62 51.90
AudioCaps-Dev Sound FENSE↑ 49.00 60.79 62.13 59.70
Clotho-Test Sound FENSE↑ 48.01 47.55 49.35 44.30
AutoACD-Test Sound FENSE↑ 44.76 55.93 67.13 59.20

Metrics: Higher is better. "-" denotes data not available.

Audio and Paralinguistic Classification

Dataset Task Metric Kimi-Audio-Instruct Qwen2.5-Omni-7B MiDashengLM-7B-1021 MiDashengLM-0.6B
VoxCeleb1 Speaker ID ACC↑ 82.72 59.71 92.66 90.84
VoxLingua107 Language ID ACC↑ 73.65 51.03 93.72 86.39
VoxCeleb-Gender Gender ID ACC↑ 99.69 99.82 97.72 96.80
VGGSound Sound Event MAP↑ 2.20 0.97 52.19 28.05
CochlScene Sound Scene ACC↑ 18.34 23.88 75.81 75.78
NSynth-Instrument Music Instrument ACC↑ 38.09 60.45 80.32 64.33
FreeMusicArchive Music Genre ACC↑ 27.91 66.77 62.94 17.50
FSDKaggle2018 Sound Event MAP↑ 24.75 31.38 73.38 80.84
AudioSet Sound Event MAP↑ 3.47 6.48 9.90 6.82
FSD50K Sound Event MAP↑ 27.23 23.87 38.10 34.15

Metrics: Higher is better.

ASR Performance

Dataset Metric Kimi-Audio-Instruct Qwen2.5-Omni-7B MiDashengLM-7B-1021 MiDashengLM-0.6B
LibriSpeech-Clean WER↓ 1.30 1.70 3.60 4.41
LibriSpeech-Other WER↓ 2.40 3.40 5.90 10.66
People's Speech CER↓ 22.30 28.60 26.12 29.15
AISHELL-2-Mic CER↓ 2.70 2.50 3.20 6.30
AISHELL-2-iOS CER↓ 2.60 2.60 2.90 5.67
AISHELL-2-Android CER↓ 2.60 2.70 3.10 7.37
GigaSpeech2-Indonesian WER↓ >100 21.20 22.30 26.00
GigaSpeech2-Thai WER↓ >100 53.80 38.40 23.20
GigaSpeech2-Viet WER↓ >100 18.60 17.70 68.88

Metrics: WER/CER (lower is better).

Question Answering Results

We first evaluate MECAT-QA, which targets reasoning and assessment over audio (subsets such as direct perception, sound characteristics, quality assessment, environment reasoning, inference / judgment, and application-oriented content), reported with DATE. Notably, the compact MDL-0.6B model improves markedly on these MECAT-QA subsets compared to the 7B MiDashengLM-7B-1021 model, reflecting strong zero-shot reasoning, logical inference, and fine-grained analysis despite its smaller size.

MMAU-Pro follows, using the Answer task and ACC↑ across capability splits (IF, Multi-Audio, Music, Open-Ended, Sound, cross-modal combinations, Spatial, Speech, Voice, and the overall Average).

Finally, we include additional QA benchmarks: AudioCaps-QA and MusicQA (FENSE↑), and MuChoMusic (ACC↑).

Dataset Subset Metric Kimi-Audio-Instruct Qwen2.5-Omni-7B MiDashengLM-7B-1021 MiDashengLM-0.6B
MECAT-QA Direct Perception DATE 45.60 57.80 64.20 71.70
MECAT-QA Sound Characteristics DATE 39.20 52.90 31.20 69.30
MECAT-QA Quality Assessment DATE 18.70 39.10 20.20 57.50
MECAT-QA Environment Reasoning DATE 34.60 44.00 20.10 66.30
MECAT-QA Inference / Judgment DATE 48.90 53.20 35.30 66.40
MECAT-QA Application Content DATE 41.20 50.80 33.60 65.80
MMAU-Pro IF ACC↑ 42.30 61.30 37.93 56.32
MMAU-Pro Multi-Audio ACC↑ 17.20 24.30 42.33 35.12
MMAU-Pro Music ACC↑ 57.60 61.50 62.20 38.15
MMAU-Pro Open-Ended ACC↑ 34.50 52.30 63.21 50.21
MMAU-Pro Sound ACC↑ 46.00 47.60 58.36 33.24
MMAU-Pro Sound–Music ACC↑ 46.00 40.00 42.00 38.00
MMAU-Pro Sound–Music–Speech ACC↑ 42.80 28.50 71.43 42.86
MMAU-Pro Spatial ACC↑ 43.70 41.20 18.77 53.23
MMAU-Pro Speech ACC↑ 52.20 57.40 61.17 33.56
MMAU-Pro Speech–Music ACC↑ 54.30 53.20 58.70 23.91
MMAU-Pro Speech–Sound ACC↑ 48.90 60.20 51.14 30.68
MMAU-Pro Voice ACC↑ 50.60 60.00 54.83 28.97
MMAU-Pro Average ACC↑ 46.60 52.20 55.92 38.69
AudioCaps-QA FENSE↑ 47.34 53.28 54.20 41.70
MusicQA FENSE↑ 40.00 60.60 61.56 36.10
MuChoMusic ACC↑ 67.40 64.79 73.04 35.80

Metrics: Higher is better. "-" denotes data not available.

Citation

If you find MDL-0.6B useful in your research or business applications, please cite the underlying work it builds on: MiDashengLM (efficient audio understanding with general audio captions) and DashengTokenizer (unified continuous audio tokenization for understanding and generation).

@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}

@misc{dinkel2026dashengtokenizer,
  title        = {DashengTokenizer: One layer is enough for unified audio understanding and generation},
  author       = {Heinrich Dinkel and Xingwei Sun and Gang Li and Jiahao Mei and Yadong Niu and Jizhong Liu and Xiyang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan},
  year         = {2026},
  eprint       = {2602.23765},
  archivePrefix= {arXiv},
  primaryClass = {cs.SD},
  url          = {https://arxiv.org/abs/2602.23765},
}

If you are interested in caption datasets and evaluation, see ACAVCaps and MECAT.

@misc{niu2026acavcaps,
  title         = {ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding},
  author        = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Junbo Zhang and Jian Luan},
  year          = {2026},
  eprint        = {2603.24038},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2603.24038},
}

@misc{niu2025mecat,
  title         = {MECAT: a multi-experts constructed benchmark for fine-grained audio understanding tasks},
  author        = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Xiyang Liu and Junbo Zhang and Jian Luan},
  year          = {2025},
  eprint        = {2507.23511},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2507.23511},
}
Downloads last month
36
Safetensors
Model size
0.7B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for mispeech/midashenglm-0.6b-fp32