Improve model card and add metadata for UniAudio 2.0

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -1,3 +1,85 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: audio-to-audio
4
+ tags:
5
+ - audio
6
+ - speech
7
+ - music
8
+ - audio-generation
9
+ ---
10
+
11
+ # UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization
12
+
13
+ UniAudio 2.0 is a unified audio foundation model for speech, sound, and music. It uses **ReasoningCodec** (reasoning tokens and reconstruction tokens) and a unified autoregressive architecture trained on 100B text and 60B audio tokens.
14
+
15
+ - **Paper:** [UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization](https://huggingface.co/papers/2602.04683)
16
+ - **Project Page:** [Demo 🎶](https://dongchaoyang.top/UniAudio2Demo/)
17
+ - **Code:** [GitHub Repository](https://github.com/yangdongchao/UniAudio2)
18
+
19
+ ## Supported Tasks
20
+
21
+ - **Speech:** TTS (EN/ZH/Yue), Audio-Instructed TTS, InstructTTS, ASR, Dysarthric Speech Recognition, S2S Q&A, S2T Q&A
22
+ - **Sound:** Text-to-Sound, Audio Caption, audio-question answer
23
+ - **Music:** Song Generation (EN/ZH) and Recognition, Text-to-Music Generation, music-question answer
24
+
25
+ ## Installation
26
+
27
+ ```bash
28
+ # Clone the repo
29
+ git clone https://github.com/yangdongchao/UniAudio2
30
+ cd UniAudio2
31
+
32
+ # Create environment (Python 3.10)
33
+ conda create -n uniaudio2 python=3.10
34
+ conda activate uniaudio2
35
+
36
+ # Editable install
37
+ pip install -e .
38
+ ```
39
+
40
+ ## Sample Usage
41
+
42
+ All tasks are run via the `multi_task_inference.py` script. You need to download the checkpoints and update paths in `tools/tokenizer/ReasoningCodec_film/codec_infer_config.yaml`.
43
+
44
+ ### Understanding (Audio → Text) - ASR Example
45
+
46
+ ```bash
47
+ python multi_task_inference.py \
48
+ --task ASR \
49
+ --audio samples/p225_002.wav \
50
+ --output_dir ./ASR_output \
51
+ --llm_train_config <LLM_CONFIG> \
52
+ --exp_dir <EXP_DIR> \
53
+ --resume <RESUME> \
54
+ --text_tokenizer_path tools/tokenizer/Text2ID/llama3_2_tokenizer \
55
+ --prompt_text "Transcribe the provided audio recording into accurate text." \
56
+ --audio_tokenizer_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
57
+ --codec_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
58
+ --codec_ckpt <CODEC_CKPT>
59
+ ```
60
+
61
+ ### Generation (Text → Audio) - TTS Example
62
+
63
+ ```bash
64
+ python multi_task_inference.py \
65
+ --task TTS \
66
+ --stage all \
67
+ --text "Hello, this is a test." \
68
+ --output_dir ./TTS_output \
69
+ --llm_train_config <LLM_CONFIG> --exp_dir <EXP_DIR> --resume <RESUME> \
70
+ --text_tokenizer_path tools/tokenizer/Text2ID/llama3_2_tokenizer \
71
+ --prompt_text "Convert the given text into natural speech." \
72
+ --audio_tokenizer_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
73
+ --codec_config tools/tokenizer/ReasoningCodec_film/infer_config.yaml \
74
+ --codec_ckpt <CODEC_CKPT> --codec_steps 10
75
+ ```
76
+
77
+ ## Citation
78
+
79
+ ```bibtex
80
+ @article{uniaudio2,
81
+ title={UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization},
82
+ author={Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng},
83
+ year={2026}
84
+ }
85
+ ```