Audio-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen2
text-generation
speech-language-model
streaming
audio
multimodal
qwen2.5-omni
text-generation-inference
Instructions to use zhifeixie/AudioInteraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhifeixie/AudioInteraction with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zhifeixie/AudioInteraction") model = AutoModelForCausalLM.from_pretrained("zhifeixie/AudioInteraction") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@ license: apache-2.0
|
|
| 6 |
library_name: transformers
|
| 7 |
pipeline_tag: audio-text-to-text
|
| 8 |
datasets:
|
| 9 |
-
- zhifeixie/
|
| 10 |
tags:
|
| 11 |
- speech-language-model
|
| 12 |
- streaming
|
|
@@ -14,17 +14,17 @@ tags:
|
|
| 14 |
- multimodal
|
| 15 |
- qwen2.5-omni
|
| 16 |
---
|
| 17 |
-
#
|
| 18 |
|
| 19 |
-
[**Code**](https://github.com/xzf-thu/
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
|
| 24 |
|
| 25 |
## Model Details
|
| 26 |
|
| 27 |
-
- **Model name:**
|
| 28 |
- **Task:** Streaming audio-conditioned text generation (audio in, text out)
|
| 29 |
- **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
|
| 30 |
- **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
|
|
@@ -36,7 +36,7 @@ This design lets the model handle both spoken questions ("answer it") and ambien
|
|
| 36 |
## Repository Contents
|
| 37 |
|
| 38 |
```text
|
| 39 |
-
|
| 40 |
βββ model-00001-of-00004.safetensors # LM weights, sharded (β4 GB each)
|
| 41 |
βββ model-00002-of-00004.safetensors
|
| 42 |
βββ model-00003-of-00004.safetensors
|
|
@@ -54,28 +54,28 @@ Mini-Omni3/
|
|
| 54 |
|
| 55 |
## Intended Use
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
## Quick Start
|
| 60 |
|
| 61 |
### Installation
|
| 62 |
|
| 63 |
```bash
|
| 64 |
-
git clone https://github.com/xzf-thu/
|
| 65 |
-
cd
|
| 66 |
-
conda create -n
|
| 67 |
-
conda activate
|
| 68 |
pip install -r requirements.txt
|
| 69 |
```
|
| 70 |
|
| 71 |
### Download the checkpoint
|
| 72 |
|
| 73 |
-
From the `
|
| 74 |
|
| 75 |
```python
|
| 76 |
from huggingface_hub import snapshot_download
|
| 77 |
|
| 78 |
-
snapshot_download(repo_id="zhifeixie/
|
| 79 |
```
|
| 80 |
|
| 81 |
`snapshot_download` is the recommended path β it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
|
|
@@ -150,7 +150,7 @@ Candidate metrics:
|
|
| 150 |
<!-- TODO: replace with the real arxiv id and year once published. -->
|
| 151 |
```bibtex
|
| 152 |
@misc{xie_miniomni3,
|
| 153 |
-
title = {
|
| 154 |
author = {Zhifei Xie and collaborators},
|
| 155 |
year = {2026},
|
| 156 |
note = {Preprint in preparation}
|
|
@@ -159,4 +159,4 @@ Candidate metrics:
|
|
| 159 |
|
| 160 |
## Acknowledgements
|
| 161 |
|
| 162 |
-
|
|
|
|
| 6 |
library_name: transformers
|
| 7 |
pipeline_tag: audio-text-to-text
|
| 8 |
datasets:
|
| 9 |
+
- zhifeixie/StreamAudio-2M
|
| 10 |
tags:
|
| 11 |
- speech-language-model
|
| 12 |
- streaming
|
|
|
|
| 14 |
- multimodal
|
| 15 |
- qwen2.5-omni
|
| 16 |
---
|
| 17 |
+
# Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
|
| 18 |
|
| 19 |
+
[**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Audio-Interaction-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
|
| 20 |
|
| 21 |
+
Audio-Interaction is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
|
| 22 |
|
| 23 |
This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
|
| 24 |
|
| 25 |
## Model Details
|
| 26 |
|
| 27 |
+
- **Model name:** Audio-Interaction
|
| 28 |
- **Task:** Streaming audio-conditioned text generation (audio in, text out)
|
| 29 |
- **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
|
| 30 |
- **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
|
|
|
|
| 36 |
## Repository Contents
|
| 37 |
|
| 38 |
```text
|
| 39 |
+
Audio-Interaction/
|
| 40 |
βββ model-00001-of-00004.safetensors # LM weights, sharded (β4 GB each)
|
| 41 |
βββ model-00002-of-00004.safetensors
|
| 42 |
βββ model-00003-of-00004.safetensors
|
|
|
|
| 54 |
|
| 55 |
## Intended Use
|
| 56 |
|
| 57 |
+
Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives β for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
|
| 58 |
|
| 59 |
## Quick Start
|
| 60 |
|
| 61 |
### Installation
|
| 62 |
|
| 63 |
```bash
|
| 64 |
+
git clone https://github.com/xzf-thu/Audio-Interaction.git # TODO: confirm repo URL
|
| 65 |
+
cd Audio-Interaction
|
| 66 |
+
conda create -n Audio-Interaction python=3.10 -y
|
| 67 |
+
conda activate Audio-Interaction
|
| 68 |
pip install -r requirements.txt
|
| 69 |
```
|
| 70 |
|
| 71 |
### Download the checkpoint
|
| 72 |
|
| 73 |
+
From the `Audio-Interaction` project root, pull the weights into `checkpoints/`:
|
| 74 |
|
| 75 |
```python
|
| 76 |
from huggingface_hub import snapshot_download
|
| 77 |
|
| 78 |
+
snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
|
| 79 |
```
|
| 80 |
|
| 81 |
`snapshot_download` is the recommended path β it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
|
|
|
|
| 150 |
<!-- TODO: replace with the real arxiv id and year once published. -->
|
| 151 |
```bibtex
|
| 152 |
@misc{xie_miniomni3,
|
| 153 |
+
title = {Audio-Interaction: Streaming Audio-In, Text-Out Conversational Modeling},
|
| 154 |
author = {Zhifei Xie and collaborators},
|
| 155 |
year = {2026},
|
| 156 |
note = {Preprint in preparation}
|
|
|
|
| 159 |
|
| 160 |
## Acknowledgements
|
| 161 |
|
| 162 |
+
Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project.
|