zhichen commited on
Commit
a9df9a8
·
verified ·
1 Parent(s): 4e6c79c

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +69 -3
  2. README_EN.md +241 -0
README.md CHANGED
@@ -1,3 +1,69 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 中文 | [English](README_EN.md)
2
+
3
+ <p align="center">
4
+ <picture>
5
+ <img alt="Ultravox" src="https://zfmrfvimiaqahezndsse.supabase.co/storage/v1/object/public/images/custom/Introducing%20Ultravox%20Wide.jpg">
6
+ </picture>
7
+ </p>
8
+
9
+ <h3 align="center">
10
+ 一款为实时语音交互设计的快速多模态LLM
11
+ </h3>
12
+
13
+
14
+ # 概述
15
+
16
+ Ultravox是一种新型的多模态LLM,能够理解文本和人类语音,无需单独的自动语音识别(ASR)阶段。基于[AudioLM](https://arxiv.org/abs/2209.03143)、[SeamlessM4T](https://ai.meta.com/blog/seamless-m4t/)、[Gazelle](https://tincans.ai/slm)、[SpeechGPT](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt)等研究,Ultravox能够将任何开放权重LLM扩展为一个多模态投影器,直接将音频转换为LLM使用的高维空间。
17
+
18
+ ultravox官方仓库:[https://github.com/fixie-ai/ultravox](https://github.com/fixie-ai/ultravox)
19
+
20
+ ultravox-cn仓库:[https://github.com/seanzhang-zhichen/ultravox-cn](https://github.com/seanzhang-zhichen/ultravox-cn)
21
+
22
+ 由于官方版本模型对中文支持较差,因此,我们训练了基于Qwen2.5-7B-Instruct和whisper-large-v3-turbo的中文友好的语音多模态模型
23
+
24
+ ### 架构
25
+
26
+ [![架构图](https://raw.githubusercontent.com/fixie-ai/ultravox/main/docs/assets/Ultravox%20Model%20Architecture.svg)](https://docs.google.com/presentation/d/1ey81xuuMzrJaBwztb_Rq24Cit37GQokD2aAes_KkGVI/edit)
27
+
28
+
29
+ ### 效果
30
+
31
+ ![ultravox-cn-web](docs/assets/ultravox-cn-web.png)
32
+
33
+ ### 模型
34
+
35
+ - Huggingface下载地址:[https://huggingface.co/zhichen/ultravox-cn](https://huggingface.co/zhichen/ultravox-cn)
36
+ - Modelscope下载地址:[https://modelscope.cn/models/seanzhang/ultravox-cn](https://modelscope.cn/models/seanzhang/ultravox-cn)
37
+
38
+
39
+ ## 环境设置
40
+
41
+ 安装`just`
42
+
43
+ ```bash
44
+ git clone https://github.com/seanzhang-zhichen/ultravox-cn.git
45
+ cd ultravox-cn
46
+ sudo apt-get install just
47
+ conda create -n ultravox python=3.11
48
+ conda activate ultravox
49
+ just install
50
+ ```
51
+
52
+ ## 模型准备
53
+
54
+ 运行demo前,需准备以下模型:
55
+
56
+ - Qwen2.5-7B-Instruct
57
+ - whisper-large-v3-turbo
58
+ - seanzhang/ultravox-cn
59
+
60
+ 以上模型准备好后,修改seanzhang/ultravox-cn/config.json中的audio_model_id为本地whisper-large-v3-turbo路径,text_model_id为本地Qwen2.5-7B-Instruct路径。
61
+
62
+ ![config.json](docs/assets/config.png)
63
+
64
+ ### Web Demo
65
+
66
+ ```bash
67
+ python ultravox/tools/gradio_demo.py --model_path seanzhang/ultravox-cn(或本地路径)
68
+ ```
69
+
README_EN.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [中文](README.md) | English
2
+
3
+ <p align="center">
4
+ <picture>
5
+ <img alt="Ultravox" src="https://zfmrfvimiaqahezndsse.supabase.co/storage/v1/object/public/images/custom/Introducing%20Ultravox%20Wide.jpg">
6
+ </picture>
7
+ </p>
8
+
9
+ <h3 align="center">
10
+ A fast multimodal LLM designed for real-time voice interactions
11
+ </h3>
12
+
13
+ _Latest News_
14
+ * 2025/02 — [Ultravox 0.5](https://github.com/fixie-ai/ultravox/releases/tag/v0.5) available
15
+ * 2024/11 — [Ultravox 0.4.1](https://github.com/fixie-ai/ultravox/releases/tag/v0.4.1) available
16
+ * 2024/08 — [Ultravox 0.4](https://github.com/fixie-ai/ultravox/releases/tag/v0.4) available
17
+ * 2024/08 — [Ultravox 0.3](https://github.com/fixie-ai/ultravox/releases/tag/v0.3) available
18
+ * 2024/08 — Preview of Ultravox APIs available, more information [here](https://fixie-ai.github.io/ultradox/)
19
+
20
+ _Key Links_
21
+ * [Ultravox Realtime](https://ultravox.ai) — Build real-time Voice AI agents on top of the Ultravox model
22
+ * [Hugging Face](https://huggingface.co/fixie-ai) — Our Hugging Face page
23
+
24
+ ---
25
+
26
+ # About
27
+
28
+ Ultravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage. Building on research like [AudioLM](https://arxiv.org/abs/2209.03143), [SeamlessM4T](https://ai.meta.com/blog/seamless-m4t/), [Gazelle](https://tincans.ai/slm), [SpeechGPT](https://github.com/0nutation/SpeechGPT/tree/main/speechgpt), and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.
29
+
30
+ Ultravox currently takes in audio and emits streaming text. As we evolve the model, we'll train it to be able to emit a stream of speech tokens that can then be converted directly into raw audio by an appropriate unit vocoder.
31
+
32
+ Our default model is built on top of Llama 3.3 70B. We also have an 8B variant available on Hugging Face.
33
+
34
+ Ultravox can be trained against any open-weight model. See below for more details on training.
35
+
36
+ ### Demo
37
+
38
+ See Ultravox in action on our [demo page](https://demo.ultravox.ai). You can build your own voice-to-voice agents on our Realtime platform at ultravox.ai.
39
+
40
+
41
+ ### Discord
42
+
43
+ Join us on our Discord server [here](https://discord.gg/Qw6KHxv8YB).
44
+
45
+ ### Jobs
46
+
47
+ If you're interested in working on Ultravox fulltime, we're hiring! Check out our jobs page [here](https://careers.fixie.ai).
48
+
49
+ ### Inference Server
50
+
51
+ You can try out Ultravox using your own audio content (as a WAV file) by spinning up an Ultravox instance on our partner, BaseTen: [https://www.baseten.co/library/ultravox/](https://www.baseten.co/library/ultravox/). They offer free credits to get started.
52
+
53
+ If you're interested in running Ultravox in a real-time capacity, we offer a set of managed APIs as well. You can learn more about getting access to those [here](https://docs.ultravox.ai).
54
+
55
+ ### Model
56
+
57
+ You can download the latest weights from the [Ultravox Hugging Face page](https://huggingface.co/fixie-ai/).
58
+
59
+ ### Architecture
60
+
61
+ [![architecture diagram](https://raw.githubusercontent.com/fixie-ai/ultravox/main/docs/assets/Ultravox%20Model%20Architecture.svg)](https://docs.google.com/presentation/d/1ey81xuuMzrJaBwztb_Rq24Cit37GQokD2aAes_KkGVI/edit)
62
+
63
+ # Contributing
64
+
65
+ Read on if you're interested in training your own version of Ultravox.
66
+
67
+ ## Environment Setup (Mac)
68
+
69
+ Install the basic tools:
70
+
71
+ - [`Homebrew`](https://brew.sh) is a package manager for MacOS that also mostly works for Linux. If you're running Debian or Ubuntu Linux, you can alternatively get by with apt.
72
+ - [`Just`](https://just.systems/man/en/) simplifies our shell workflows. It frequently functions as our interface to all the other tools.
73
+
74
+ ```bash
75
+ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
76
+ brew update
77
+ brew install just
78
+ ```
79
+
80
+ It's recommended to use pyenv for managing environments due to the use of Poetry:
81
+
82
+ ```bash
83
+ brew install xz
84
+ brew install pyenv
85
+ pyenv init
86
+ pyenv install 3.11
87
+ pyenv global 3.11
88
+
89
+ # Optional
90
+ pyenv shell 3.11
91
+ ```
92
+
93
+ >**Note**: Use of conda is NOT recommended with Poetry
94
+
95
+ After creating a virtual environment, install required packages using `just` and `poetry`:
96
+
97
+ ```bash
98
+ just install
99
+ ```
100
+
101
+ We're using Poetry to manage the Python virtual environment. You can observe your environment with `poetry env info`.
102
+
103
+ ### Mosaic Environment Setup (Fixie Internal)
104
+
105
+ If you want to use [Mosaic](https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/getting_started.html) for training, you need to setup a few things to run on the Mosaic Platform.
106
+
107
+ 1. Install & login to the Mosaic CLI
108
+
109
+ ```bash
110
+ pip install --upgrade mosaicml-cli
111
+
112
+ mcli init
113
+
114
+ mcli set api-key <new-value>
115
+ ```
116
+
117
+ 2. set API keys for tools we use:
118
+
119
+ ```bash
120
+ # Huggging Face token for accessing walled data and models
121
+ mcli create secret env HF_TOKEN=hf_<your_token>
122
+ mcli create secret env HF_WRITE_TOKEN=hf_<your_token_with_write_access>
123
+
124
+ # WandB token for logging experiments
125
+ mcli create secret env WANDB_PROJECT=ultravox
126
+ mcli create secret env WANDB_API_KEY=<your_wandb_key>
127
+ ```
128
+
129
+ ## Training
130
+
131
+ Currently, we keep both the LLM and the audio encoder frozen and only train the adapter/projector. Training Ultraox v0.4 took 2-3 hours on 8xH100 GPUs for 14K training steps.
132
+
133
+ ### Use-Cases for Training Ultravox
134
+ Why would you want to (re-) train Ultravox? Here are a few scenarios:
135
+
136
+ 1. You want to use a different LLM or audio encoder backbone.
137
+
138
+ a. In this case you need to re-train the adapter. You can use `release_config.yaml`, which contains our config for our latest release, and you should be able to simply change the base LLM or encoder by specifying `--text-model <hf-model-id-for-llm>` and/or `--audio-model <hf-model-id-for-encoder>`.
139
+
140
+ 2. You want to improve the knowledge of the model
141
+
142
+ a. We suggest to either use RAG on the fly (no training needed), or fine-tune the LLM backbone instead. Fine-tuning the LLM backbone does not require re-training Ultravox (i.e., the existing adapter will work).
143
+
144
+ 3. You want to use your own audio data, for example to add support for a new language.
145
+
146
+ a. First step, prepare your dataset: at bare minimum, the samples should have an `audio` and a text `continuation` field.
147
+
148
+ b. Take a look at [`ds_tool.py`](ultravox/tools/ds_tool/ds_tool.py) and [`continuation.jinja`](ultravox/tools/ds_tool/continuation.jinja) as well as [our variant of Common Voice](https://huggingface.co/datasets/fixie-ai/common_voice_17_0/viewer/fr) that was created using `ds_tool` to add the `continuation` field.
149
+
150
+ c. Add your dataset to the dataset mix in `release_config.yaml` and train.
151
+
152
+ There's no one-size fits all. If you need help you can find us on our Discord server [here](https://discord.gg/Qw6KHxv8YB).
153
+
154
+
155
+ ### How to Train
156
+
157
+ We do most of our training on the [MosaicML platform](https://docs.mosaicml.com) and therefore most of our tooling and docs are Mosaic-related. However, you can do the same training on your own GPU without much difficulty. Here we assume you have the environment set up (run `just install`). You can also take a look at [`setup.sh`](setup.sh)
158
+
159
+ To kick off a training run you can do:
160
+ ```bash
161
+ poetry run python -m ultravox.training.train --config_path ultravox/training/configs/release_config.yaml
162
+ ```
163
+
164
+ For DDP training make sure to add `torchrun`. We also recommend prefetching weights in advance:
165
+ ```bash
166
+ TRAIN_ARGS="--config_path ultravox/training/configs/release_config.yaml"
167
+ poetry run python -m ultravox.training.helpers.prefetch_weights $TRAIN_ARGS
168
+ poetry run torchrun --nproc_per_node=8 -m ultravox.training.train $TRAIN_ARGS
169
+ ```
170
+
171
+ For a debug run, you can use smaller models, datasets, or batch size. Here's a config that uses TinyLlama as the LLM backbone:
172
+ ```bash
173
+ poetry run python -m ultravox.training.train --config_path ultravox/training/configs/asr_tinyllama_100s.yaml --batch_size 1 --report_logs_to tensorboard
174
+ ```
175
+
176
+
177
+ We use [SimpleParsing](https://github.com/lebrice/simpleparsing/) for configs. Configs are composable (i.e. you can specify zero or many configs) and `meta_config.yaml` is always used as the default.
178
+ See [`configs_base.py`](ultravox/training/config_base.py) to find the parameters you modify, such as the `--text-model`, `--device`, `--exp-name`, etc.
179
+
180
+
181
+ ### MosaicML Training (Fixie Internal)
182
+
183
+ Before running any training jobs, set up [SSH authentication with MosaicML](https://docs.mosaicml.com/projects/mcli/en/latest/resources/secrets/ssh.html#page-secrets-ssh):
184
+
185
+ 1. Generate an SSH key:
186
+ ```bash
187
+ ssh-keygen -f ~/.ssh/mclid_id_rsa
188
+ ```
189
+
190
+ 2. Add the public key to your GitHub account
191
+
192
+ 3. Upload the private key to MosaicML (this allows MosaicML to clone the repository and run jobs):
193
+ ```bash
194
+ mcli create secret git-ssh ~/.ssh/mclid_id_rsa
195
+ ```
196
+
197
+ Then you can run the following command to kick off a training job:
198
+
199
+ ```bash
200
+ mcli run -f mcloud_train.yaml --follow
201
+ ```
202
+
203
+ Other useful commands:
204
+
205
+ ```bash
206
+ mcli get clusters
207
+
208
+ mcli util r7z2
209
+ mcli get runs
210
+ mcli get runs --cluster r7z2
211
+
212
+ mcli run -f mcloud_eval.yaml --follow
213
+ ```
214
+
215
+ For interactive runs you can use:
216
+ ```bash
217
+ just mcloud --image mosaicml/composer:latest --max-duration 1
218
+ ```
219
+
220
+ IMPORTANT: Make sure to monitor your jobs and stop the machine when you're done with any job, specially interactive ones!
221
+
222
+ ### Running evaluations
223
+
224
+ For inference or evaluations, you can use:
225
+
226
+ ```bash
227
+ just eval --config_path ultravox/evaluation/configs/eval_config.yaml
228
+ ```
229
+
230
+ where `eval_config.yaml` is a config file that specifies the model, datasets, and configurations to use for inference or evaluation. If your dataset is not already defined in ultravox, you need to create a config file for your dataset in `ultravox/data/configs/` (with the appropriate `eval_config` field to specify evaluation metrics and arguments), and register it in `ultravox/data/registry.py`. Please refer to examples in `ultravox/data/configs/`.
231
+
232
+ ## Misc
233
+
234
+ The [Justfile](Justfile) is a good resource for finding popular commands. Here are a few:
235
+
236
+ ```bash
237
+ just update # update dependencies
238
+ just format # run formatting (black, isort, autoflake)
239
+ just test # run tests
240
+ just python # activate venv and run python
241
+ ```