Improve model card and add metadata for UniAudio 2.0
Browse filesHi! I'm Niels from the Hugging Face community science team. I've updated the model card for UniAudio 2.0 to include:
- Metadata for the `audio-to-audio` pipeline tag and the MIT license.
- Links to the research paper, project demo page, and official GitHub repository.
- A summary of supported tasks across speech, sound, and music.
- Sample usage instructions for both understanding (ASR) and generation (TTS) based on the official documentation.
- The BibTeX citation for the paper.
README.md
CHANGED
|
@@ -1,3 +1,85 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: audio-to-audio
|
| 4 |
+
tags:
|
| 5 |
+
- audio
|
| 6 |
+
- speech
|
| 7 |
+
- music
|
| 8 |
+
- audio-generation
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
|
| 12 |
+
|
| 13 |
+
UniAudio 2.0 is a unified audio foundation model for speech, sound, and music. It uses **ReasoningCodec** (reasoning tokens and reconstruction tokens) and a unified autoregressive architecture trained on 100B text and 60B audio tokens.
|
| 14 |
+
|
| 15 |
+
- **Paper:** [UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization](https://huggingface.co/papers/2602.04683)
|
| 16 |
+
- **Project Page:** [Demo 🎶](https://dongchaoyang.top/UniAudio2Demo/)
|
| 17 |
+
- **Code:** [GitHub Repository](https://github.com/yangdongchao/UniAudio2)
|
| 18 |
+
|
| 19 |
+
## Supported Tasks
|
| 20 |
+
|
| 21 |
+
- **Speech:** TTS (EN/ZH/Yue), Audio-Instructed TTS, InstructTTS, ASR, Dysarthric Speech Recognition, S2S Q&A, S2T Q&A
|
| 22 |
+
- **Sound:** Text-to-Sound, Audio Caption, audio-question answer
|
| 23 |
+
- **Music:** Song Generation (EN/ZH) and Recognition, Text-to-Music Generation, music-question answer
|
| 24 |
+
|
| 25 |
+
## Installation
|
| 26 |
+
|
| 27 |
+
```bash
|
| 28 |
+
# Clone the repo
|
| 29 |
+
git clone https://github.com/yangdongchao/UniAudio2
|
| 30 |
+
cd UniAudio2
|
| 31 |
+
|
| 32 |
+
# Create environment (Python 3.10)
|
| 33 |
+
conda create -n uniaudio2 python=3.10
|
| 34 |
+
conda activate uniaudio2
|
| 35 |
+
|
| 36 |
+
# Editable install
|
| 37 |
+
pip install -e .
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Sample Usage
|
| 41 |
+
|
| 42 |
+
All tasks are run via the `multi_task_inference.py` script. You need to download the checkpoints and update paths in `tools/tokenizer/ReasoningCodec_film/codec_infer_config.yaml`.
|
| 43 |
+
|
| 44 |
+
### Understanding (Audio → Text) - ASR Example
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
python multi_task_inference.py \
|
| 48 |
+
--task ASR \
|
| 49 |
+
--audio samples/p225_002.wav \
|
| 50 |
+
--output_dir ./ASR_output \
|
| 51 |
+
--llm_train_config <LLM_CONFIG> \
|
| 52 |
+
--exp_dir <EXP_DIR> \
|
| 53 |
+
--resume <RESUME> \
|
| 54 |
+
--text_tokenizer_path tools/tokenizer/Text2ID/llama3_2_tokenizer \
|
| 55 |
+
--prompt_text "Transcribe the provided audio recording into accurate text." \
|
| 56 |
+
--audio_tokenizer_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
|
| 57 |
+
--codec_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
|
| 58 |
+
--codec_ckpt <CODEC_CKPT>
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Generation (Text → Audio) - TTS Example
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
python multi_task_inference.py \
|
| 65 |
+
--task TTS \
|
| 66 |
+
--stage all \
|
| 67 |
+
--text "Hello, this is a test." \
|
| 68 |
+
--output_dir ./TTS_output \
|
| 69 |
+
--llm_train_config <LLM_CONFIG> --exp_dir <EXP_DIR> --resume <RESUME> \
|
| 70 |
+
--text_tokenizer_path tools/tokenizer/Text2ID/llama3_2_tokenizer \
|
| 71 |
+
--prompt_text "Convert the given text into natural speech." \
|
| 72 |
+
--audio_tokenizer_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
|
| 73 |
+
--codec_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
|
| 74 |
+
--codec_ckpt <CODEC_CKPT> --codec_steps 10
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
## Citation
|
| 78 |
+
|
| 79 |
+
```bibtex
|
| 80 |
+
@article{uniaudio2,
|
| 81 |
+
title={UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization},
|
| 82 |
+
author={Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng},
|
| 83 |
+
year={2026}
|
| 84 |
+
}
|
| 85 |
+
```
|