Audio-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen2
text-generation
speech-language-model
streaming
audio
multimodal
qwen2.5-omni
text-generation-inference
Instructions to use zhifeixie/AudioInteraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhifeixie/AudioInteraction with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zhifeixie/AudioInteraction") model = AutoModelForCausalLM.from_pretrained("zhifeixie/AudioInteraction") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -3,28 +3,25 @@ language:
|
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
license: apache-2.0
|
|
|
|
| 6 |
pipeline_tag: audio-text-to-text
|
|
|
|
|
|
|
| 7 |
tags:
|
| 8 |
- speech-language-model
|
| 9 |
- streaming
|
| 10 |
- audio
|
| 11 |
- multimodal
|
| 12 |
- qwen2.5-omni
|
| 13 |
-
datasets:
|
| 14 |
-
- zhifeixie/StreamAudio-2M
|
| 15 |
-
base_model:
|
| 16 |
-
- Qwen/Qwen2.5-Omni-3B
|
| 17 |
---
|
| 18 |
# Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
|
| 19 |
|
| 20 |
-
[**Code**](https://github.com/xzf-thu/Mini-Omni3) <!-- TODO: confirm repo URL -->
|
| 21 |
|
| 22 |
Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
|
| 23 |
|
| 24 |
This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
|
| 25 |
|
| 26 |
-
The release contains the Mini-Omni3 language-model weights (sharded safetensors), a chunk-wise audio encoder adapted from Qwen2.5-Omni, and the matching tokenizer and model config.
|
| 27 |
-
|
| 28 |
## Model Details
|
| 29 |
|
| 30 |
- **Model name:** Mini-Omni3
|
|
@@ -40,14 +37,19 @@ The release contains the Mini-Omni3 language-model weights (sharded safetensors)
|
|
| 40 |
|
| 41 |
```text
|
| 42 |
Mini-Omni3/
|
| 43 |
-
βββ
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
βββ
|
| 47 |
-
βββ
|
| 48 |
-
βββ
|
| 49 |
-
βββ
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
```
|
| 52 |
|
| 53 |
## Intended Use
|
|
@@ -68,23 +70,23 @@ pip install -r requirements.txt
|
|
| 68 |
|
| 69 |
### Download the checkpoint
|
| 70 |
|
|
|
|
|
|
|
| 71 |
```python
|
| 72 |
from huggingface_hub import snapshot_download
|
| 73 |
|
| 74 |
-
|
| 75 |
-
repo_id="zhifeixie/Mini-Omni3",
|
| 76 |
-
repo_type="model",
|
| 77 |
-
)
|
| 78 |
-
print(local_dir)
|
| 79 |
```
|
| 80 |
|
|
|
|
|
|
|
| 81 |
### Python Usage
|
| 82 |
|
| 83 |
```python
|
| 84 |
from src.miniomni3.generate.run import run_inference
|
| 85 |
|
| 86 |
run_inference(
|
| 87 |
-
checkpoint_dir=
|
| 88 |
audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
|
| 89 |
device="cuda:0", # or "mps" / "cpu"
|
| 90 |
)
|
|
@@ -93,7 +95,7 @@ run_inference(
|
|
| 93 |
For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
|
| 94 |
|
| 95 |
```python
|
| 96 |
-
run_inference(checkpoint_dir=
|
| 97 |
```
|
| 98 |
|
| 99 |
## Streaming Protocol
|
|
|
|
| 3 |
- en
|
| 4 |
- zh
|
| 5 |
license: apache-2.0
|
| 6 |
+
library_name: transformers
|
| 7 |
pipeline_tag: audio-text-to-text
|
| 8 |
+
datasets:
|
| 9 |
+
- zhifeixie/Mini-Omni3-Data # TODO: replace with the actual dataset repo id
|
| 10 |
tags:
|
| 11 |
- speech-language-model
|
| 12 |
- streaming
|
| 13 |
- audio
|
| 14 |
- multimodal
|
| 15 |
- qwen2.5-omni
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
---
|
| 17 |
# Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
|
| 18 |
|
| 19 |
+
[**Code**](https://github.com/xzf-thu/Mini-Omni3) | [**Model**](https://huggingface.co/zhifeixie/Mini-Omni3) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Mini-Omni3-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
|
| 20 |
|
| 21 |
Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
|
| 22 |
|
| 23 |
This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
|
| 24 |
|
|
|
|
|
|
|
| 25 |
## Model Details
|
| 26 |
|
| 27 |
- **Model name:** Mini-Omni3
|
|
|
|
| 37 |
|
| 38 |
```text
|
| 39 |
Mini-Omni3/
|
| 40 |
+
βββ model-00001-of-00004.safetensors # LM weights, sharded (β4 GB each)
|
| 41 |
+
βββ model-00002-of-00004.safetensors
|
| 42 |
+
βββ model-00003-of-00004.safetensors
|
| 43 |
+
βββ model-00004-of-00004.safetensors
|
| 44 |
+
βββ model.safetensors.index.json # Shard index consumed by safetensors loader
|
| 45 |
+
βββ config.json # Top-level model config
|
| 46 |
+
βββ generation_config.json # Generation defaults
|
| 47 |
+
βββ model_config.yaml # GPT config consumed by Config.from_file
|
| 48 |
+
βββ hyperparameters.yaml # Training-time hyperparameters (reference)
|
| 49 |
+
βββ tokenizer.json # Tokenizer
|
| 50 |
+
βββ tokenizer_config.json
|
| 51 |
+
βββ MiniOmni3_ChunkwisedEncoder.pth # Audio encoder weights (Qwen2.5-Omni audio tower)
|
| 52 |
+
βββ qwen25OmniConfig/ # Audio-encoder config (nested: thinker_config.audio_config)
|
| 53 |
```
|
| 54 |
|
| 55 |
## Intended Use
|
|
|
|
| 70 |
|
| 71 |
### Download the checkpoint
|
| 72 |
|
| 73 |
+
From the `Mini-Omni3` project root, pull the weights into `checkpoints/`:
|
| 74 |
+
|
| 75 |
```python
|
| 76 |
from huggingface_hub import snapshot_download
|
| 77 |
|
| 78 |
+
snapshot_download(repo_id="zhifeixie/Mini-Omni3", local_dir="checkpoints")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
```
|
| 80 |
|
| 81 |
+
`snapshot_download` is the recommended path β it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
|
| 82 |
+
|
| 83 |
### Python Usage
|
| 84 |
|
| 85 |
```python
|
| 86 |
from src.miniomni3.generate.run import run_inference
|
| 87 |
|
| 88 |
run_inference(
|
| 89 |
+
checkpoint_dir="checkpoints",
|
| 90 |
audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
|
| 91 |
device="cuda:0", # or "mps" / "cpu"
|
| 92 |
)
|
|
|
|
| 95 |
For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
|
| 96 |
|
| 97 |
```python
|
| 98 |
+
run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")
|
| 99 |
```
|
| 100 |
|
| 101 |
## Streaming Protocol
|