zhifeixie commited on
Commit
e0c9e81
Β·
verified Β·
1 Parent(s): c0caf87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -6,7 +6,7 @@ license: apache-2.0
6
  library_name: transformers
7
  pipeline_tag: audio-text-to-text
8
  datasets:
9
- - zhifeixie/Mini-Omni3-Data # TODO: replace with the actual dataset repo id
10
  tags:
11
  - speech-language-model
12
  - streaming
@@ -14,17 +14,17 @@ tags:
14
  - multimodal
15
  - qwen2.5-omni
16
  ---
17
- # Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
18
 
19
- [**Code**](https://github.com/xzf-thu/Mini-Omni3) | [**Model**](https://huggingface.co/zhifeixie/Mini-Omni3) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Mini-Omni3-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
20
 
21
- Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
22
 
23
  This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
24
 
25
  ## Model Details
26
 
27
- - **Model name:** Mini-Omni3
28
  - **Task:** Streaming audio-conditioned text generation (audio in, text out)
29
  - **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
30
  - **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
@@ -36,7 +36,7 @@ This design lets the model handle both spoken questions ("answer it") and ambien
36
  ## Repository Contents
37
 
38
  ```text
39
- Mini-Omni3/
40
  β”œβ”€β”€ model-00001-of-00004.safetensors # LM weights, sharded (β‰ˆ4 GB each)
41
  β”œβ”€β”€ model-00002-of-00004.safetensors
42
  β”œβ”€β”€ model-00003-of-00004.safetensors
@@ -54,28 +54,28 @@ Mini-Omni3/
54
 
55
  ## Intended Use
56
 
57
- Mini-Omni3 is intended for streaming conversational agents that need to react to audio as it arrives β€” for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
58
 
59
  ## Quick Start
60
 
61
  ### Installation
62
 
63
  ```bash
64
- git clone https://github.com/xzf-thu/Mini-Omni3.git # TODO: confirm repo URL
65
- cd Mini-Omni3
66
- conda create -n mini-omni3 python=3.10 -y
67
- conda activate mini-omni3
68
  pip install -r requirements.txt
69
  ```
70
 
71
  ### Download the checkpoint
72
 
73
- From the `Mini-Omni3` project root, pull the weights into `checkpoints/`:
74
 
75
  ```python
76
  from huggingface_hub import snapshot_download
77
 
78
- snapshot_download(repo_id="zhifeixie/Mini-Omni3", local_dir="checkpoints")
79
  ```
80
 
81
  `snapshot_download` is the recommended path β€” it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
@@ -150,7 +150,7 @@ Candidate metrics:
150
  <!-- TODO: replace with the real arxiv id and year once published. -->
151
  ```bibtex
152
  @misc{xie_miniomni3,
153
- title = {Mini-Omni3: Streaming Audio-In, Text-Out Conversational Modeling},
154
  author = {Zhifei Xie and collaborators},
155
  year = {2026},
156
  note = {Preprint in preparation}
@@ -159,4 +159,4 @@ Candidate metrics:
159
 
160
  ## Acknowledgements
161
 
162
- Mini-Omni3 builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.
 
6
  library_name: transformers
7
  pipeline_tag: audio-text-to-text
8
  datasets:
9
+ - zhifeixie/StreamAudio-2M
10
  tags:
11
  - speech-language-model
12
  - streaming
 
14
  - multimodal
15
  - qwen2.5-omni
16
  ---
17
+ # Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
18
 
19
+ [**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Audio-Interaction-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
20
 
21
+ Audio-Interaction is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
22
 
23
  This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
24
 
25
  ## Model Details
26
 
27
+ - **Model name:** Audio-Interaction
28
  - **Task:** Streaming audio-conditioned text generation (audio in, text out)
29
  - **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
30
  - **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
 
36
  ## Repository Contents
37
 
38
  ```text
39
+ Audio-Interaction/
40
  β”œβ”€β”€ model-00001-of-00004.safetensors # LM weights, sharded (β‰ˆ4 GB each)
41
  β”œβ”€β”€ model-00002-of-00004.safetensors
42
  β”œβ”€β”€ model-00003-of-00004.safetensors
 
54
 
55
  ## Intended Use
56
 
57
+ Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives β€” for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
58
 
59
  ## Quick Start
60
 
61
  ### Installation
62
 
63
  ```bash
64
+ git clone https://github.com/xzf-thu/Audio-Interaction.git # TODO: confirm repo URL
65
+ cd Audio-Interaction
66
+ conda create -n Audio-Interaction python=3.10 -y
67
+ conda activate Audio-Interaction
68
  pip install -r requirements.txt
69
  ```
70
 
71
  ### Download the checkpoint
72
 
73
+ From the `Audio-Interaction` project root, pull the weights into `checkpoints/`:
74
 
75
  ```python
76
  from huggingface_hub import snapshot_download
77
 
78
+ snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
79
  ```
80
 
81
  `snapshot_download` is the recommended path β€” it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
 
150
  <!-- TODO: replace with the real arxiv id and year once published. -->
151
  ```bibtex
152
  @misc{xie_miniomni3,
153
+ title = {Audio-Interaction: Streaming Audio-In, Text-Out Conversational Modeling},
154
  author = {Zhifei Xie and collaborators},
155
  year = {2026},
156
  note = {Preprint in preparation}
 
159
 
160
  ## Acknowledgements
161
 
162
+ Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.