--- license: apache-2.0 language: - en - zh tags: - audio - speech - music - understanding - multimodal - instruct pipeline_tag: audio-text-to-text --- # MOSS-Audio

MOSS-Audio is an open-source **audio understanding model** from [MOSI.AI](https://mosi.cn/#hero), the [OpenMOSS team](https://www.open-moss.com/), and [Shanghai Innovation Institute](https://www.sii.edu.cn/). It performs unified modeling over complex real-world audio, supporting **speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning**. In this release, we provide **four models**: **MOSS-Audio-4B-Instruct**, **MOSS-Audio-4B-Thinking**, **MOSS-Audio-8B-Instruct**, and **MOSS-Audio-8B-Thinking**. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities. ## News * 2026.4.13: 🎉🎉🎉 We have released [MOSS-Audio](https://huggingface.co/collections/OpenMOSS-Team/moss-audio). Blog and paper coming soon! ## Contents - [Introduction](#introduction) - [Model Architecture](#model-architecture) - [DeepStack Cross-Layer Feature Injection](#deepstack-cross-layer-feature-injection) - [Time-Aware Representation](#time-aware-representation) - [Released Models](#released-models) - [Evaluation](#evaluation) - [Quickstart](#quickstart) - [Environment Setup](#environment-setup) - [Basic Usage](#basic-usage) - [Gradio App](#gradio-app) - [SGLang Serving](#sglang-serving) - [More Information](#more-information) - [Citation](#citation) ## Introduction

Understanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. **MOSS-Audio** is built to unify these capabilities within a single model. - **Speech & Content Understanding**: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment. - **Speaker, Emotion & Event Analysis**: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio. - **Scene & Sound Cue Extraction**: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere. - **Music Understanding**: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments. - **Audio Question Answering & Summarization**: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information. - **Time-Aware QA**: Supports time-aware questions, including word-level and sentence-level timestamp ASR. - **Complex Reasoning**: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning. ## Model Architecture

MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal representations at **12.5 Hz**, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation. Rather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains. ### DeepStack Cross-Layer Feature Injection Using only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a **DeepStack**-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions. This design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture. ### Time-Aware Representation Time is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a **time-marker insertion** strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn "what happened when" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection. ## Released Models | Model | Audio Encoder | LLM Backbone | Total Size | Hugging Face | |---|---|---|---:|---| | **MOSS-Audio-4B-Instruct** | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-4B-Instruct) | **MOSS-Audio-4B-Thinking** | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-4B-Thinking) | **MOSS-Audio-8B-Instruct** | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-8B-Instruct) | **MOSS-Audio-8B-Thinking** | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | [![Hugging Face](https://img.shields.io/badge/Huggingface-Model-orange?logo=huggingface)](https://huggingface.co/OpenMOSS-Team/MOSS-Audio-8B-Thinking) > More model families, sizes, and variants will be released in the future. Stay tuned! ## Evaluation We evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results: - **General Audio Understanding**: MOSS-Audio-8B-Thinking achieves an average accuracy of **70.80**, outperforming all of the open-source models. - **Speech Captioning**: MOSS-Audio-Instruct variants lead across **11 out of 13** fine-grained speech description dimensions, with **MOSS-Audio-8B-Instruct** achieving the best overall average score (**3.7252**). - **ASR**: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the **lowest overall CER (11.30)**, with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios. - **Timestamp ASR**: MOSS-Audio-8B-Instruct achieves **35.77 AAS** on AISHELL-1 and **131.61 AAS** on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy. ### General Audio Understanding (Accuracy↑)

Model	Model Size	MMAU	MMAU-Pro	MMAR	MMSU	Avg
Open Source (small)
Kimi-Audio	7B	72.41	56.58	60.82	54.74	61.14
Qwen2.5-Omni	7B	65.60	52.20	56.70	61.32	58.96
Audio Flamingo 3	7B	61.23	51.70	57.96	60.04	57.73
MiMo-Audio-7B	7B	74.90	53.35	61.70	61.94	62.97
MiniCPM-o-4.5	9B	70.97	39.65	55.75	60.96	56.83
MOSS-Audio-4B-Instruct	4B	75.79	58.16	59.68	59.68	64.04
MOSS-Audio-4B-Thinking	4B	77.64	60.75	63.91	71.20	68.37
MOSS-Audio-8B-Instruct	8B	77.03	57.48	64.42	66.36	66.32
MOSS-Audio-8B-Thinking	8B	77.13	64.29	65.73	76.06	70.80
Open Source (large)
Qwen3-Omni-30B-A3B-Instruct	30B	75.00	61.22	66.40	69.00	67.91
Step-Audio-R1.1	33B	72.18	60.80	68.75	64.18	66.48
Step-Audio-R1	33B	78.67	59.68	69.15	75.18	70.67
Closed Source
GPT4o-Audio	-	65.66	52.30	59.78	58.76	59.13
Gemini-3-Pro	-	80.15	68.28	81.73	81.28	77.86
Gemini-3.1-Pro	-	81.10	73.47	83.70	81.30	79.89

### Speech Captioning (LLM-as-a-Judge Score↑)

Speech Captioning (click to expand)

| Model | gender | age | accent | pitch | volume | speed | texture | clarity | fluency | emotion | tone | personality | summary | Avg | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Qwen3-Omni-30B-A3B-Instruct | 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 | | Qwen3-Omni-30B-A3B-Thinking | 4.419 | **4.026** | 4.327 | 3.610 | 3.577 | 3.610 | 3.179 | 3.403 | 3.526 | 3.232 | 3.154 | 3.197 | 3.107 | 3.5667 | | Gemini-3-Pro | 4.191 | 3.835 | 4.181 | 3.392 | 3.254 | 3.320 | 2.998 | 3.347 | 3.524 | 3.055 | 2.997 | 3.023 | 2.775 | 3.3763 | | Gemini-3.1-Pro| 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | **3.328** | 3.224 | 3.292 | 3.179 | 3.5986 | | MOSS-Audio-4B-Instruct | **4.697** | 3.980 | 4.497 | 3.628 | **3.722** | 3.564 | **3.407** | 3.841 | 3.744 | 3.311 | **3.282** | **3.305** | 3.259 | 3.7105 | | MOSS-Audio-8B-Instruct | 4.683 | 3.979 | **4.572** | **3.682** | 3.709 | **3.638** | 3.403 | **3.869** | **3.747** | 3.314 | 3.253 | 3.272 | **3.307** | **3.7252** |

### ASR | Model | Overall | Health Condition | Dialect | Singing | Non-Speech Vocalizations | Code-Switching | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field / Near-Field | Multi-Speaker | Age | Semantic Content | |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | Paraformer-Large | 15.77 | 22.18 | 43.45 | 32.34 | 4.95 | 12.65 | 3.11 | 4.67 | 5.02 | 17.46 | 20.33 | 14.96 | 7.14 | | GLM-ASR-Nano | 17.29 | 24.49 | 22.39 | 51.95 | 4.65 | 11.88 | 3.68 | 5.02 | 4.94 | 27.51 | 28.02 | 17.19 | 7.32 | | Fun-ASR-Nano | 12.04 | 21.99 | 7.80 | 19.35 | 4.76 | 11.23 | 2.98 | 3.46 | 3.78 | 18.38 | 19.82 | **14.95** | 6.08 | | SenseVoice-Small | 14.50 | 24.04 | 8.89 | 23.79 | 4.92 | 13.90 | 4.13 | 4.93 | 5.57 | 26.66 | 24.06 | 17.63 | 7.55 | | Kimi-Audio-7B-Instruct | 14.12 | 21.11 | 29.34 | 21.76 | 4.68 | 16.38 | **2.20** | **2.15** | 2.66 | 21.02 | 20.61 | 16.74 | 6.12 | | Qwen2.5-Omni-3B | 15.26 | 24.65 | 33.87 | 24.24 | 5.54 | 11.66 | 2.76 | 3.56 | 4.32 | 22.15 | 22.91 | 15.17 | 7.24 | | Qwen2.5-Omni-7B | 15.05 | 23.85 | 31.91 | 22.69 | 4.56 | 12.97 | 2.52 | 3.16 | 3.64 | 25.38 | 21.01 | 16.13 | 6.78 | | Qwen3-Omni-30B-A3B-Instruct | 11.39 | 20.73 | 15.63 | 16.01 | 4.73 | 11.30 | 2.23 | 2.47 | **1.90** | **17.08** | **18.15** | **11.46** | **5.74** | | **MOSS-Audio-4B-Instruct** | 11.58 | 21.11 | 11.84 | 10.79 | **4.01** | **10.11** | 3.11 | 3.72 | 3.29 | 18.48 | 20.33 | 15.09 | 8.15 | | **MOSS-Audio-8B-Instruct** | **11.30** | **19.18** | **8.76** | **9.81** | 4.31 | 10.18 | 2.70 | 3.20 | 2.75 | 24.04 | 24.36 | 15.26 | 7.69 |

Detailed ASR Results (click to expand)

Model	Acoustic Environment (Clean)			Acoustic Environment (Noisy)	Acoustic Characteristics: Whisper	Acoustic Characteristics: Far-Field / Near-Field	Multi-Speaker	Age		Health Condition		Semantic Content		Code-Switching			Dialect		Singing		Non-Speech Vocalizations
Model	AISHELL-1 test	AISHELL-2 Android \| IOS \| Mic	THCHS-30 test	MAGICDATA-READ test	AISHELL6-Whisper normal \| whisper	AliMeeting Test_Ali_far \| Test_Ali_near	AISHELL-4 test	SeniorTalk sentence	ChildMandarin test	AISHELL-6A mild \| moderate \| severe \| StutteringSpeech	AISHELL_6B LRDWWS \| Uncontrol	WenetSpeech test-meeting	Fleurs cmn_hans_cn	CS-Dialogue test	TALCS test	ASCEND test	KeSpeech test	WSYue-ASR-eval short	MIR-1K test	openc-pop test	MNV_17
Paraformer-Large	1.98	3.28 \| 3.21 \| 3.00	4.07	4.67	1.11 \| 8.92	25.64 \| 9.27	20.33	17.31	12.60	6.98 \| 9.30 \| 13.34 \| 10.74	47.59 \| 45.08	7.88	6.40	10.64	10.77	16.55	11.48	75.42	57.70	6.98	4.95
GLM-ASR-Nano	2.89	3.75 \| 3.73 \| 3.78	4.23	5.02	0.83 \| 9.06	40.27 \| 14.76	28.02	20.33	14.06	8.74 \| 12.11 \| 14.38 \| 12.29	50.34 \| 49.09	9.70	4.94	11.06	11.07	13.50	9.72	35.07	95.87	8.03	4.65
Fun-ASR-Nano	2.16	3.04 \| 2.99 \| 3.07	3.65	3.46	0.81 \| 6.76	27.21 \| 9.55	19.82	16.96	12.94	6.60 \| 8.81 \| 12.98 \| 10.30	47.42 \| 45.84	7.39	4.76	10.47	8.09	15.13	7.43	8.17	35.85	2.84	4.76
SenseVoice-Small	3.23	4.16 \| 4.02 \| 3.96	5.26	4.93	1.25 \| 9.88	37.01 \| 16.31	24.06	21.07	14.18	7.62 \| 9.85 \| 14.39 \| 11.47	52.92 \| 47.97	8.35	6.75	12.81	10.52	18.38	10.45	7.34	39.51	8.07	4.92
Kimi-Audio-7B-Instruct	0.79	2.91 \| 3.03 \| 2.88	1.39	2.15	0.69 \| 4.63	28.22 \| 13.82	20.61	19.70	13.79	7.00 \| 9.34 \| 12.56 \| 10.75	44.44 \| 42.57	7.15	5.10	14.56	12.74	21.83	5.51	53.17	38.35	5.17	4.68
Qwen2.5-Omni-3B	1.51	3.10 \| 2.94 \| 2.93	3.32	3.56	0.82 \| 7.82	32.14 \| 12.16	22.91	17.38	12.96	6.87 \| 10.55 \| 14.57 \| 11.33	54.54 \| 50.03	9.04	5.45	10.78	10.94	13.25	7.67	60.06	45.00	3.47	5.54
Qwen2.5-Omni-7B	1.16	2.88 \| 2.77 \| 2.73	3.06	3.16	0.71 \| 6.57	32.03 \| 18.73	21.01	19.96	12.29	7.27 \| 10.94 \| 12.92 \| 10.53	51.99 \| 49.45	8.43	5.13	14.02	10.46	14.42	6.40	57.43	42.62	2.75	4.56
Qwen3-Omni-30B-A3B-Instruct	0.95	2.70 \| 2.72 \| 2.57	2.21	2.47	0.59 \| 3.22	25.72 \| 8.44	18.15	14.13	8.79	6.20 \| 8.88 \| 11.59 \| 10.25	45.80 \| 41.65	6.64	4.84	12.94	8.33	12.64	5.87	25.39	30.81	1.21	4.73
MOSS-Audio-4B-Instruct	2.26	3.22 \| 3.20 \| 3.33	3.53	3.72	0.73 \| 5.86	27.27 \| 9.68	20.33	16.93	13.25	6.36 \| 9.77 \| 12.68 \| 10.28	43.35 \| 44.25	8.17	8.13	9.14	8.37	12.83	14.65	9.04	18.47	3.10	4.01
MOSS-Audio-8B-Instruct	1.82	2.97 \| 2.95 \| 2.91	2.82	3.20	0.69 \| 4.80	36.82 \| 11.25	24.36	17.42	13.10	5.84 \| 8.94 \| 11.52 \| 9.72	39.76 \| 39.27	7.86	7.52	9.07	8.22	13.26	9.18	8.33	17.24	2.39	4.31

### Timestamp ASR (AAS↓) | Model | AISHELL-1(zh) | LibriSpeech(en) | |---|---:|---:| | Qwen3-Omni-30B-A3B-Instruct | 833.66 | 646.95 | | Gemini-3.1-Pro| 708.24 | 871.19 | | MOSS-Audio-4B-Instruct | 76.96 | 358.13 | | **MOSS-Audio-8B-Instruct** | **35.77** | **131.61** | ## Quickstart ### Environment Setup We recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference. #### Recommended setup ```bash git clone https://github.com/OpenMOSS/MOSS-Audio.git cd MOSS-Audio conda create -n moss-audio python=3.12 -y conda activate moss-audio conda install -c conda-forge "ffmpeg=7" -y pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime]" ``` #### Optional: FlashAttention 2 If your GPU supports FlashAttention 2, you can replace the last install command with: ```bash pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e ".[torch-runtime,flash-attn]" ``` ### Basic Usage Download the model first: ```bash huggingface-cli download OpenMOSS-Team/MOSS-Audio --local-dir ./weights/MOSS-Audio huggingface-cli download OpenMOSS-Team/MOSS-Audio-Instruct --local-dir ./weights/MOSS-Audio-Instruct ``` Then edit `MODEL_PATH` / `AUDIO_PATH` in `infer.py` as needed, and run: ```bash python infer.py ``` The default prompt in `infer.py` is `Describe this audio.` You can directly edit that line if you want to try transcription, audio QA, or speech captioning. ### Gradio App Start the Gradio demo with: ```bash python app.py ``` ### SGLang Serving If you want to serve MOSS-Audio with SGLang, see the full guide in `moss_audio_usage_guide.md`. The shortest setup is: ```bash git clone -b moss-audio https://github.com/OpenMOSS/sglang.git cd sglang pip install -e "python[all]" pip install nvidia-cudnn-cu12==9.16.0.29 cd .. sglang serve --model-path ./weights/MOSS-Audio --trust-remote-code ``` If you use the default `torch==2.9.1+cu128` runtime, installing `nvidia-cudnn-cu12==9.16.0.29` is recommended before starting `sglang serve`. ## More Information - **MOSI.AI**: [https://mosi.cn](https://mosi.cn) - **OpenMOSS**: [https://www.open-moss.com](https://www.open-moss.com) ## LICENSE Models in MOSS-Audio are licensed under the Apache License 2.0. ## Citation ```bibtex @misc{mossaudio2026, title={MOSS-Audio Technical Report}, author={OpenMOSS Team}, year={2026}, howpublished={\url{https://github.com/OpenMOSS/MOSS-Audio}}, note={GitHub repository} } ```