Dongchao
/

UniAudio2_ckpt

Safetensors

Model card Files Files and versions

xet

Community

Improve model card and add metadata for UniAudio 2.0

by nielsr HF Staff - opened Feb 5

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+85

-3

Files changed (1) hide show

README.md +85 -3

README.md CHANGED Viewed

@@ -1,3 +1,85 @@
----
-license: mit
----

+---
+license: mit
+pipeline_tag: audio-to-audio
+tags:
+- audio
+- speech
+- music
+- audio-generation
+---
+# UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
+UniAudio 2.0 is a unified audio foundation model for speech, sound, and music. It uses **ReasoningCodec** (reasoning tokens and reconstruction tokens) and a unified autoregressive architecture trained on 100B text and 60B audio tokens.
+- **Paper:** [UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization](https://huggingface.co/papers/2602.04683)
+- **Project Page:** [Demo 🎶](https://dongchaoyang.top/UniAudio2Demo/)
+- **Code:** [GitHub Repository](https://github.com/yangdongchao/UniAudio2)
+## Supported Tasks
+- **Speech:** TTS (EN/ZH/Yue), Audio-Instructed TTS, InstructTTS, ASR, Dysarthric Speech Recognition, S2S Q&A, S2T Q&A
+- **Sound:** Text-to-Sound, Audio Caption, audio-question answer
+- **Music:** Song Generation (EN/ZH) and Recognition, Text-to-Music Generation, music-question answer
+## Installation
+```bash
+# Clone the repo
+git clone https://github.com/yangdongchao/UniAudio2
+cd UniAudio2
+# Create environment (Python 3.10)
+conda create -n uniaudio2 python=3.10
+conda activate uniaudio2
+# Editable install
+pip install -e .
+```
+## Sample Usage
+All tasks are run via the `multi_task_inference.py` script. You need to download the checkpoints and update paths in `tools/tokenizer/ReasoningCodec_film/codec_infer_config.yaml`.
+### Understanding (Audio → Text) - ASR Example
+```bash
+python multi_task_inference.py \
+  --task ASR \
+  --audio samples/p225_002.wav \
+  --output_dir ./ASR_output \
+  --llm_train_config <LLM_CONFIG> \
+  --exp_dir <EXP_DIR> \
+  --resume <RESUME> \
+  --text_tokenizer_path tools/tokenizer/Text2ID/llama3_2_tokenizer \
+  --prompt_text "Transcribe the provided audio recording into accurate text." \
+  --audio_tokenizer_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
+  --codec_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
+  --codec_ckpt <CODEC_CKPT>
+```
+### Generation (Text → Audio) - TTS Example
+```bash
+python multi_task_inference.py \
+  --task TTS \
+  --stage all \
+  --text "Hello, this is a test." \
+  --output_dir ./TTS_output \
+  --llm_train_config <LLM_CONFIG> --exp_dir <EXP_DIR> --resume <RESUME> \
+  --text_tokenizer_path tools/tokenizer/Text2ID/llama3_2_tokenizer \
+  --prompt_text "Convert the given text into natural speech." \
+  --audio_tokenizer_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
+  --codec_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
+  --codec_ckpt <CODEC_CKPT> --codec_steps 10
+```
+## Citation
+```bibtex
+@article{uniaudio2,
+  title={UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization},
+  author={Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng},
+  year={2026}
+}
+```