Noblhyon commited on
Commit
7b0b7a4
Β·
verified Β·
1 Parent(s): a224795

Add comprehensive model card and documentation

Browse files
Files changed (1) hide show
  1. README.md +216 -0
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: any-to-any
4
+ library_name: mini-omni2
5
+ tags:
6
+ - multimodal
7
+ - speech-to-speech
8
+ - vision-language
9
+ - audio-processing
10
+ - real-time
11
+ - conversational-ai
12
+ - qwen2
13
+ - whisper
14
+ - clip
15
+ base_model: Qwen/Qwen2-0.5B
16
+ datasets:
17
+ - Open-Orca/OpenOrca
18
+ language:
19
+ - en
20
+ ---
21
+
22
+ # Mini-Omni2 by Noblhyon
23
+
24
+ ---
25
+ license: mit
26
+ pipeline_tag: any-to-any
27
+ library_name: mini-omni2
28
+ ---
29
+
30
+ # Mini-Omni2
31
+
32
+ <!-- <p align="center">
33
+ <img src="./data/figures/title.png" width="100%"/>
34
+ </p> -->
35
+
36
+
37
+ <p align="center">
38
+ πŸ€— <a href="https://huggingface.co/gpt-omni/mini-omni2">Hugging Face</a> | πŸ“– <a href="https://github.com/gpt-omni/mini-omni2">Github</a>
39
+ | πŸ“‘ <a href="https://arxiv.org/abs/2410.11190">Technical report</a>
40
+ </p>
41
+
42
+ Mini-Omni2 is an **omni-interactive** model. It can **understand image, audio and text inputs and has end-to-end voice conversations with users**. Featuring **real-time voice output**, **omni-capable multimodal understanding** and flexible interaction **ability with interruption mechanism while speaking**.
43
+
44
+ <p align="center">
45
+ <img src="./data/figures/framework.jpeg" width="100%"/>
46
+ </p>
47
+
48
+
49
+ ## Updates
50
+
51
+ - **2024.10:** Release the model, technical report, inference and chat demo code.
52
+
53
+ ## Features
54
+ βœ… **Multimodal interaction**: with the ability to understand images, speech and text, just like GPT-4o.
55
+
56
+ βœ… **Real-time speech-to-speech** conversational capabilities. No extra ASR or TTS models required, just like [Mini-Omni](https://github.com/gpt-omni/mini-omni).
57
+
58
+ <!-- βœ… **Streaming audio output**: with first-chunk latency of audio stream less than 0.3s. -->
59
+
60
+ <!-- βœ… **Duplex interaction**: hearing while speaking, it can be interrupted by key words like "stop omni". -->
61
+
62
+
63
+ ## Demo
64
+
65
+ NOTE: need to unmute first.
66
+
67
+ https://github.com/user-attachments/assets/ad97ca7f-f8b4-40c3-a7e8-fa54b4edf155
68
+
69
+
70
+ ## ToDo
71
+ - [ ] update interruption mechanism
72
+
73
+
74
+ ## Install
75
+
76
+ Create a new conda environment and install the required packages:
77
+
78
+ ```sh
79
+ conda create -n omni python=3.10
80
+ conda activate omni
81
+
82
+ git clone https://github.com/gpt-omni/mini-omni2.git
83
+ cd mini-omni2
84
+ pip install -r requirements.txt
85
+ ```
86
+
87
+ ## Quick start
88
+
89
+ **Interactive demo**
90
+
91
+ - start server
92
+
93
+ NOTE: you need to start the server before running the streamlit or gradio demo with API_URL set to the server address.
94
+
95
+ ```sh
96
+ sudo apt-get install ffmpeg
97
+ conda activate omni
98
+ cd mini-omni2
99
+ python3 server.py --ip '0.0.0.0' --port 60808
100
+ ```
101
+
102
+
103
+ - run streamlit demo
104
+
105
+ NOTE: you need to run streamlit **locally** with PyAudio installed.
106
+
107
+ ```sh
108
+ pip install PyAudio==0.2.14
109
+ API_URL=http://0.0.0.0:60808/chat streamlit run webui/omni_streamlit.py
110
+ ```
111
+
112
+
113
+ **Local test**
114
+
115
+ ```sh
116
+ conda activate omni
117
+ cd mini-omni2
118
+ # test run the preset audio samples and questions
119
+ python inference_vision.py
120
+ ```
121
+
122
+ ## Mini-Omni2 Overview
123
+
124
+ **1. Multimodal Modeling**:
125
+ We use multiple sequences as the input and output of the model. In the input part, we will concatenate image, audio and text features to perform a series of comprehensive tasks, as shown in the following figures. In the output part, we use text-guided delayed parallel output to generate real-time speech responses.
126
+ <p align="center">
127
+ <img src="./data/figures/inputids.png" width="100%"/>
128
+ </p>
129
+
130
+ **2. Multi-stage Training**:
131
+ We propose an efficient alignment training method and conduct encoder adaptation, modal alignment, and multimodal fine-tuning respectively in the three-stage training.
132
+ <p align="center">
133
+ <img src="./data/figures/training.jpeg" width="100%"/>
134
+ </p>
135
+
136
+ <!-- **3. Cases**:
137
+ Here are more cases of Mini-Omni2:
138
+ <p align="center">
139
+ <img src="./data/figures/samples.png" width="100%"/>
140
+ </p> -->
141
+
142
+ ## FAQ
143
+
144
+ **1. Does the model support other languages?**
145
+
146
+ No, the model is only trained on English. However, as we use whisper as the audio encoder, the model can understand other languages which is supported by whisper (like chinese), but the output is only in English.
147
+
148
+ **2. Error: can not run streamlit in local browser, with remote streamlit server**
149
+
150
+ You need start streamlit **locally** with PyAudio installed.
151
+
152
+
153
+ ## Acknowledgements
154
+
155
+ - [Qwen2](https://github.com/QwenLM/Qwen2/) as the LLM backbone.
156
+ - [litGPT](https://github.com/Lightning-AI/litgpt/) for training and inference.
157
+ - [whisper](https://github.com/openai/whisper/) for audio encoding.
158
+ - [clip](https://github.com/openai/CLIP) for image encoding.
159
+ - [snac](https://github.com/hubertsiuzdak/snac/) for audio decoding.
160
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) for generating synthetic speech.
161
+ - [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [MOSS](https://github.com/OpenMOSS/MOSS/tree/main) for alignment.
162
+
163
+ <!-- ## Star History
164
+
165
+ [![Star History Chart](https://api.star-history.com/svg?repos=gpt-omni/mini-omni2&type=Date)](https://star-history.com/#gpt-omni/mini-omni2&Date)
166
+
167
+ ## Model Files Description
168
+
169
+ - **lit_model.pth**: Main LitGPT model weights
170
+ - **small.pt**: Compressed model checkpoint
171
+ - **ViT-B-32.pt**: Vision Transformer weights for image encoding
172
+ - **tokenizer.json**: Tokenizer vocabulary and configuration
173
+ - **tokenizer_config.json**: Tokenizer configuration parameters
174
+ - **model_config.yaml**: Model architecture and training configuration
175
+
176
+ ## Usage
177
+
178
+ ```python
179
+ from huggingface_hub import hf_hub_download
180
+
181
+ # Download model files
182
+ model_path = hf_hub_download(repo_id="Noblhyon/mini-omni2", filename="lit_model.pth")
183
+ config_path = hf_hub_download(repo_id="Noblhyon/mini-omni2", filename="model_config.yaml")
184
+ tokenizer_path = hf_hub_download(repo_id="Noblhyon/mini-omni2", filename="tokenizer.json")
185
+
186
+ # Load your model here
187
+ # (Add your specific loading code)
188
+ ```
189
+
190
+ ## Repository Structure
191
+
192
+ ```
193
+ mini-omni2/
194
+ β”œβ”€β”€ lit_model.pth # Main model weights
195
+ β”œβ”€β”€ small.pt # Compressed checkpoint
196
+ β”œβ”€β”€ ViT-B-32.pt # Vision encoder
197
+ β”œβ”€β”€ tokenizer.json # Tokenizer vocabulary
198
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer config
199
+ β”œβ”€β”€ model_config.yaml # Model configuration
200
+ └── data/ # Figures and demo files
201
+ β”œβ”€β”€ figures/
202
+ └── omni2-demo.mp4
203
+ ```
204
+
205
+ ## Citation
206
+
207
+ If you use this model, please cite:
208
+
209
+ ```bibtex
210
+ @article{miniomni2024,
211
+ title={Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities},
212
+ author={Noblhyon},
213
+ year={2024},
214
+ url={https://huggingface.co/Noblhyon/mini-omni2}
215
+ }
216
+ ```