zhifeixie
/

AudioInteraction

@@ -6,7 +6,7 @@ license: apache-2.0
 library_name: transformers
 pipeline_tag: audio-text-to-text
 datasets:
-- zhifeixie/Mini-Omni3-Data  # TODO: replace with the actual dataset repo id
 tags:
 - speech-language-model
 - streaming
@@ -14,17 +14,17 @@ tags:
 - multimodal
 - qwen2.5-omni
 ---
-# Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
-[**Code**](https://github.com/xzf-thu/Mini-Omni3) | [**Model**](https://huggingface.co/zhifeixie/Mini-Omni3) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Mini-Omni3-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
-Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
 This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
 ## Model Details
-- **Model name:** Mini-Omni3
 - **Task:** Streaming audio-conditioned text generation (audio in, text out)
 - **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
 - **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
@@ -36,7 +36,7 @@ This design lets the model handle both spoken questions ("answer it") and ambien
 ## Repository Contents
 ```text
-Mini-Omni3/
 ├── model-00001-of-00004.safetensors      # LM weights, sharded (≈4 GB each)
 ├── model-00002-of-00004.safetensors
 ├── model-00003-of-00004.safetensors
@@ -54,28 +54,28 @@ Mini-Omni3/
 ## Intended Use
-Mini-Omni3 is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
 ## Quick Start
 ### Installation
 ```bash
-git clone https://github.com/xzf-thu/Mini-Omni3.git  # TODO: confirm repo URL
-cd Mini-Omni3
-conda create -n mini-omni3 python=3.10 -y
-conda activate mini-omni3
 pip install -r requirements.txt
 ```
 ### Download the checkpoint
-From the `Mini-Omni3` project root, pull the weights into `checkpoints/`:
 ```python
 from huggingface_hub import snapshot_download
-snapshot_download(repo_id="zhifeixie/Mini-Omni3", local_dir="checkpoints")
 ```
 `snapshot_download` is the recommended path — it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
@@ -150,7 +150,7 @@ Candidate metrics:
 <!-- TODO: replace with the real arxiv id and year once published. -->
 ```bibtex
 @misc{xie_miniomni3,
-  title  = {Mini-Omni3: Streaming Audio-In, Text-Out Conversational Modeling},
   author = {Zhifei Xie and collaborators},
   year   = {2026},
   note   = {Preprint in preparation}
@@ -159,4 +159,4 @@ Candidate metrics:
 ## Acknowledgements
-Mini-Omni3 builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.

 library_name: transformers
 pipeline_tag: audio-text-to-text
 datasets:
+- zhifeixie/StreamAudio-2M
 tags:
 - speech-language-model
 - streaming
 - multimodal
 - qwen2.5-omni
 ---
+# Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
+[**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Audio-Interaction-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
+Audio-Interaction is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
 This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
 ## Model Details
+- **Model name:** Audio-Interaction
 - **Task:** Streaming audio-conditioned text generation (audio in, text out)
 - **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
 - **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
 ## Repository Contents
 ```text
+Audio-Interaction/
 ├── model-00001-of-00004.safetensors      # LM weights, sharded (≈4 GB each)
 ├── model-00002-of-00004.safetensors
 ├── model-00003-of-00004.safetensors
 ## Intended Use
+Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives — for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
 ## Quick Start
 ### Installation
 ```bash
+git clone https://github.com/xzf-thu/Audio-Interaction.git  # TODO: confirm repo URL
+cd Audio-Interaction
+conda create -n Audio-Interaction python=3.10 -y
+conda activate Audio-Interaction
 pip install -r requirements.txt
 ```
 ### Download the checkpoint
+From the `Audio-Interaction` project root, pull the weights into `checkpoints/`:
 ```python
 from huggingface_hub import snapshot_download
+snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
 ```
 `snapshot_download` is the recommended path — it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
 <!-- TODO: replace with the real arxiv id and year once published. -->
 ```bibtex
 @misc{xie_miniomni3,
+  title  = {Audio-Interaction: Streaming Audio-In, Text-Out Conversational Modeling},
   author = {Zhifei Xie and collaborators},
   year   = {2026},
   note   = {Preprint in preparation}
 ## Acknowledgements
+Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.