Audio-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen2
text-generation
speech-language-model
streaming
audio
multimodal
qwen2.5-omni
text-generation-inference
Instructions to use zhifeixie/AudioInteraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhifeixie/AudioInteraction with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zhifeixie/AudioInteraction") model = AutoModelForCausalLM.from_pretrained("zhifeixie/AudioInteraction") - Notebooks
- Google Colab
- Kaggle
File size: 6,954 Bytes
49707c0 c1144d4 49707c0 c0caf87 c1144d4 c0caf87 e0c9e81 c1144d4 49707c0 e0c9e81 c1144d4 e0c9e81 c1144d4 e0c9e81 c1144d4 e0c9e81 c1144d4 e0c9e81 c0caf87 c1144d4 e0c9e81 c1144d4 e0c9e81 c1144d4 e0c9e81 c0caf87 c1144d4 e0c9e81 c1144d4 c0caf87 c1144d4 c0caf87 c1144d4 c0caf87 c1144d4 e0c9e81 c1144d4 e0c9e81 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | ---
language:
- en
- zh
license: apache-2.0
library_name: transformers
pipeline_tag: audio-text-to-text
datasets:
- zhifeixie/StreamAudio-2M
tags:
- speech-language-model
- streaming
- audio
- multimodal
- qwen2.5-omni
---
# Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model
[**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Audio-Interaction-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
Audio-Interaction is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
## Model Details
- **Model name:** Audio-Interaction
- **Task:** Streaming audio-conditioned text generation (audio in, text out)
- **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise)
- **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk
- **Decoding states:** LISTENING (emits `KEEP_SILENCE` / `TEXT_BEGIN`) and SPEAKING (emits text until `TEXT_END`)
- **Default sampling:** temperature 0.3, top-k 3
- **Default max new tokens:** 4096 per session
- **License:** Apache-2.0
## Repository Contents
```text
Audio-Interaction/
βββ model-00001-of-00004.safetensors # LM weights, sharded (β4 GB each)
βββ model-00002-of-00004.safetensors
βββ model-00003-of-00004.safetensors
βββ model-00004-of-00004.safetensors
βββ model.safetensors.index.json # Shard index consumed by safetensors loader
βββ config.json # Top-level model config
βββ generation_config.json # Generation defaults
βββ model_config.yaml # GPT config consumed by Config.from_file
βββ hyperparameters.yaml # Training-time hyperparameters (reference)
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json
βββ MiniOmni3_ChunkwisedEncoder.pth # Audio encoder weights (Qwen2.5-Omni audio tower)
βββ qwen25OmniConfig/ # Audio-encoder config (nested: thinker_config.audio_config)
```
## Intended Use
Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives β for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript.
## Quick Start
### Installation
```bash
git clone https://github.com/xzf-thu/Audio-Interaction.git # TODO: confirm repo URL
cd Audio-Interaction
conda create -n Audio-Interaction python=3.10 -y
conda activate Audio-Interaction
pip install -r requirements.txt
```
### Download the checkpoint
From the `Audio-Interaction` project root, pull the weights into `checkpoints/`:
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints")
```
`snapshot_download` is the recommended path β it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
### Python Usage
```python
from src.miniomni3.generate.run import run_inference
run_inference(
checkpoint_dir="checkpoints",
audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
device="cuda:0", # or "mps" / "cpu"
)
```
For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
```python
run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")
```
## Streaming Protocol
A single session looks like:
```text
[system prompt tokens]
ββββ LISTENING ββββ
β AUDIO_BEGIN PAD*10 ASSISTANT β KEEP_SILENCE (keep listening)
β AUDIO_BEGIN PAD*10 ASSISTANT β TEXT_BEGIN EMOTION (start replying)
βββββββββββββββββββ
ββββ SPEAKING βββββ
β β¦ text tokens β¦ TEXT_END (reply finished)
βββββββββββββββββββ
ββββ LISTENING ββββ (next audio chunk)
β¦
```
The model is trained to emit at most one `TEXT_BEGIN` per audio chunk. Each assistant turn begins with `TEXT_BEGIN`, followed by an emotion token, the reply tokens, and `TEXT_END`. Turns starting with `KEEP_SILENCE` indicate the model chose not to respond to that chunk.
## Training Summary
<!-- TODO: fill in once details are public.
Suggested fields:
- Pretraining base
- SFT / instruction-tuning data
- Streaming-objective data construction (how KEEP_SILENCE / TEXT_BEGIN supervision was generated)
- Total tokens / hours of audio
- Hardware and duration
-->
## Evaluation
<!-- TODO: fill in once benchmarks are decided.
Candidate metrics:
- Spoken-QA accuracy on held-out audio prompts
- False-trigger rate on ambient / non-speech audio (lower is better)
- Response-onset latency in encoder chunks from end of question
- Text quality of replies (e.g. GPT-judge or human preference)
-->
## Limitations
- The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction.
- Audio must be 16 kHz mono; non-conforming inputs are resampled by `whisper.load_audio` and padded to 0.4-second boundaries before encoding.
- Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency.
- Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation.
## Citation
<!-- TODO: replace with the real arxiv id and year once published. -->
```bibtex
@misc{xie_miniomni3,
title = {Audio-Interaction: Streaming Audio-In, Text-Out Conversational Modeling},
author = {Zhifei Xie and collaborators},
year = {2026},
note = {Preprint in preparation}
}
```
## Acknowledgements
Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project. |