MDL-0.6B

Related work: 🛠️ MiDashengLM · 📚 ACAVCaps Dataset · 📊 MECAT Benchmark

The MiDashengLM-0.6B (MDL-0.6B) is a on-device audio captioning model. As a compact and efficient audio-language model, it achieves holistic audio understanding. By pairing the Qwen3-0.6B backbone with the dasheng-base-tokenizer, we have developed a model that remains substantially smaller than MDL-7B yet delivers superior performance in several discriminative tasks.

A pivotal factor in its success is Null-Instruction Modality Alignment strategy, which uses caption data without task-specific prompts to anchor audio-text representations before the supervised fine-tuning phase. Despite its lightweight architecture (0.6B parameters), it delivers competitive performance across various audio understanding benchmarks, making it highly suitable for lightweight deployments and efficient inference.

This repository provides the model weights, inference code, and preliminary evaluation results.

Usage

Load Model

from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_id = "mispeech/midashenglm-0.6b-fp32"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Construct Prompt

user_prompt = "Write the detailed caption about this audio within 1-2 sentences."  # You may try any other prompt

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "audio",
                "path": "/path/to/example.wav",
                # or "url": "https://example.com/example.wav"
                # or "audio": np.random.randn(16000)
            },
        ],
    },
]

Generate Output

import torch

sample_kwargs = dict(
    do_sample=True,
    top_p=0.8,
    top_k=50,
    temperature=1.0,
    repetition_penalty=1.05,
)

with torch.no_grad():
    model_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        add_special_tokens=True,
        return_dict=True,
    ).to(device=model.device, dtype=model.dtype)
    generation = model.generate(**model_inputs, **sample_kwargs)
    output = tokenizer.batch_decode(generation, skip_special_tokens=True)
    print(output)

Results

The following tables present the preliminary evaluation results of MDL-0.6B. We compare our compact model against the baseline MiDashengLM-7B-1021, as well as two multimodal large language models: Qwen2.5-Omni-7B and Kimi-Audio-Instruct.

Audio Captioning Results

We first evaluate on MECAT-Caption, which organizes captions into three strands. Systemic Captions comprise a concise short caption centered on the primary audio content and a long caption that adds contextual detail and how events interact. Content-Specific Captions use three branches—speech, music, and sound events—evaluated independently; the table reports pure vs. mixed variants for each. The Content-Unrelated Caption strand focuses on acoustic properties (e.g., recording quality and reverberation) rather than semantic scene content. Metrics are reported with DATE↑. Beyond MECAT-Caption, we report standard music captioning on MusiCaps and SongDescriber (FENSE↑) and general environmental / audio captioning on AudioCaps (dev), Clotho (test), and AutoACD (test) (FENSE↑).

If you are interested in the datasets and evaluation used for audio caption training and benchmarking, see ACAVCaps and MECAT.

Dataset	Domain	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
MECAT-Caption	Long	DATE↑	49.50	61.10	72.50	75.60
MECAT-Caption	Short	DATE↑	54.20	56.50	72.30	74.70
MECAT-Caption	Pure Speech	DATE↑	30.00	39.90	64.40	64.00
MECAT-Caption	Mixed Speech	DATE↑	31.30	40.90	59.90	64.30
MECAT-Caption	Pure Music	DATE↑	27.70	32.10	58.30	57.60
MECAT-Caption	Mixed Music	DATE↑	16.90	30.90	36.10	58.20
MECAT-Caption	Pure Sound	DATE↑	43.10	50.70	57.50	58.40
MECAT-Caption	Mixed Sound	DATE↑	16.20	23.80	23.00	42.40
MECAT-Caption	Environment	DATE↑	7.00	17.90	26.90	31.20
MusiCaps	Music	FENSE↑	35.43	43.71	59.11	60.70
SongDescriber	Music	FENSE↑	44.63	45.31	46.62	51.90
AudioCaps-Dev	Sound	FENSE↑	49.00	60.79	62.13	59.70
Clotho-Test	Sound	FENSE↑	48.01	47.55	49.35	44.30
AutoACD-Test	Sound	FENSE↑	44.76	55.93	67.13	59.20

Metrics: Higher is better. "-" denotes data not available.

Audio and Paralinguistic Classification

Dataset	Task	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
VoxCeleb1	Speaker ID	ACC↑	82.72	59.71	92.66	90.84
VoxLingua107	Language ID	ACC↑	73.65	51.03	93.72	86.39
VoxCeleb-Gender	Gender ID	ACC↑	99.69	99.82	97.72	96.80
VGGSound	Sound Event	MAP↑	2.20	0.97	52.19	28.05
CochlScene	Sound Scene	ACC↑	18.34	23.88	75.81	75.78
NSynth-Instrument	Music Instrument	ACC↑	38.09	60.45	80.32	64.33
FreeMusicArchive	Music Genre	ACC↑	27.91	66.77	62.94	17.50
FSDKaggle2018	Sound Event	MAP↑	24.75	31.38	73.38	80.84
AudioSet	Sound Event	MAP↑	3.47	6.48	9.90	6.82
FSD50K	Sound Event	MAP↑	27.23	23.87	38.10	34.15

Metrics: Higher is better.

ASR Performance

Dataset	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
LibriSpeech-Clean	WER↓	1.30	1.70	3.60	4.41
LibriSpeech-Other	WER↓	2.40	3.40	5.90	10.66
People's Speech	CER↓	22.30	28.60	26.12	29.15
AISHELL-2-Mic	CER↓	2.70	2.50	3.20	6.30
AISHELL-2-iOS	CER↓	2.60	2.60	2.90	5.67
AISHELL-2-Android	CER↓	2.60	2.70	3.10	7.37
GigaSpeech2-Indonesian	WER↓	>100	21.20	22.30	26.00
GigaSpeech2-Thai	WER↓	>100	53.80	38.40	23.20
GigaSpeech2-Viet	WER↓	>100	18.60	17.70	68.88

Metrics: WER/CER (lower is better).

Question Answering Results

We first evaluate MECAT-QA, which targets reasoning and assessment over audio (subsets such as direct perception, sound characteristics, quality assessment, environment reasoning, inference / judgment, and application-oriented content), reported with DATE. Notably, the compact MDL-0.6B model improves markedly on these MECAT-QA subsets compared to the 7B MiDashengLM-7B-1021 model, reflecting strong zero-shot reasoning, logical inference, and fine-grained analysis despite its smaller size.

MMAU-Pro follows, using the Answer task and ACC↑ across capability splits (IF, Multi-Audio, Music, Open-Ended, Sound, cross-modal combinations, Spatial, Speech, Voice, and the overall Average).

Finally, we include additional QA benchmarks: AudioCaps-QA and MusicQA (FENSE↑), and MuChoMusic (ACC↑).

Dataset	Subset	Metric	Kimi-Audio-Instruct	Qwen2.5-Omni-7B	MiDashengLM-7B-1021	MiDashengLM-0.6B
MECAT-QA	Direct Perception	DATE	45.60	57.80	64.20	71.70
MECAT-QA	Sound Characteristics	DATE	39.20	52.90	31.20	69.30
MECAT-QA	Quality Assessment	DATE	18.70	39.10	20.20	57.50
MECAT-QA	Environment Reasoning	DATE	34.60	44.00	20.10	66.30
MECAT-QA	Inference / Judgment	DATE	48.90	53.20	35.30	66.40
MECAT-QA	Application Content	DATE	41.20	50.80	33.60	65.80
MMAU-Pro	IF	ACC↑	42.30	61.30	37.93	56.32
MMAU-Pro	Multi-Audio	ACC↑	17.20	24.30	42.33	35.12
MMAU-Pro	Music	ACC↑	57.60	61.50	62.20	38.15
MMAU-Pro	Open-Ended	ACC↑	34.50	52.30	63.21	50.21
MMAU-Pro	Sound	ACC↑	46.00	47.60	58.36	33.24
MMAU-Pro	Sound–Music	ACC↑	46.00	40.00	42.00	38.00
MMAU-Pro	Sound–Music–Speech	ACC↑	42.80	28.50	71.43	42.86
MMAU-Pro	Spatial	ACC↑	43.70	41.20	18.77	53.23
MMAU-Pro	Speech	ACC↑	52.20	57.40	61.17	33.56
MMAU-Pro	Speech–Music	ACC↑	54.30	53.20	58.70	23.91
MMAU-Pro	Speech–Sound	ACC↑	48.90	60.20	51.14	30.68
MMAU-Pro	Voice	ACC↑	50.60	60.00	54.83	28.97
MMAU-Pro	Average	ACC↑	46.60	52.20	55.92	38.69
AudioCaps-QA	—	FENSE↑	47.34	53.28	54.20	41.70
MusicQA	—	FENSE↑	40.00	60.60	61.56	36.10
MuChoMusic	—	ACC↑	67.40	64.79	73.04	35.80

Metrics: Higher is better. "-" denotes data not available.

Citation

If you find MDL-0.6B useful in your research or business applications, please cite the underlying work it builds on: MiDashengLM (efficient audio understanding with general audio captions) and DashengTokenizer (unified continuous audio tokenization for understanding and generation).

@techreport{midashenglm7b,
  title      = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
  author     = {{Horizon Team, MiLM Plus}},
  institution= {Xiaomi Inc.},
  year       = {2025},
  note       = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
  url        = {https://arxiv.org/abs/2508.03983},
  eprint     = {2508.03983},
}

@misc{dinkel2026dashengtokenizer,
  title        = {DashengTokenizer: One layer is enough for unified audio understanding and generation},
  author       = {Heinrich Dinkel and Xingwei Sun and Gang Li and Jiahao Mei and Yadong Niu and Jizhong Liu and Xiyang Li and Yifan Liao and Jiahao Zhou and Junbo Zhang and Jian Luan},
  year         = {2026},
  eprint       = {2602.23765},
  archivePrefix= {arXiv},
  primaryClass = {cs.SD},
  url          = {https://arxiv.org/abs/2602.23765},
}

If you are interested in caption datasets and evaluation, see ACAVCaps and MECAT.

@misc{niu2026acavcaps,
  title         = {ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding},
  author        = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Junbo Zhang and Jian Luan},
  year          = {2026},
  eprint        = {2603.24038},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2603.24038},
}

@misc{niu2025mecat,
  title         = {MECAT: a multi-experts constructed benchmark for fine-grained audio understanding tasks},
  author        = {Yadong Niu and Tianzi Wang and Heinrich Dinkel and Xingwei Sun and Jiahao Zhou and Gang Li and Jizhong Liu and Xiyang Liu and Junbo Zhang and Jian Luan},
  year          = {2025},
  eprint        = {2507.23511},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2507.23511},
}