Audio-Text-to-Text
Transformers
Safetensors
English
Chinese
qwen2
text-generation
speech-language-model
streaming
audio
multimodal
qwen2.5-omni
text-generation-inference
Instructions to use zhifeixie/AudioInteraction with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use zhifeixie/AudioInteraction with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("zhifeixie/AudioInteraction") model = AutoModelForCausalLM.from_pretrained("zhifeixie/AudioInteraction") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| - zh | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: audio-text-to-text | |
| datasets: | |
| - zhifeixie/StreamAudio-2M | |
| tags: | |
| - speech-language-model | |
| - streaming | |
| - audio | |
| - multimodal | |
| - qwen2.5-omni | |
| # Audio-Interaction: Streaming Audio-In, Text-Out Conversational Model | |
| [**Code**](https://github.com/xzf-thu/Audio-Interaction) | [**Model**](https://huggingface.co/zhifeixie/Audio-Interaction) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Audio-Interaction-Data) <!-- TODO: confirm code repo URL and dataset repo id --> | |
| Audio-Interaction is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk. | |
| This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic. | |
| ## Model Details | |
| - **Model name:** Audio-Interaction | |
| - **Task:** Streaming audio-conditioned text generation (audio in, text out) | |
| - **Audio encoder:** Qwen2.5-Omni audio tower (chunk-wise) | |
| - **Audio framing:** 16 kHz, padded to 0.4-second (6400-sample) boundaries; 10 encoder-output frames per chunk | |
| - **Decoding states:** LISTENING (emits `KEEP_SILENCE` / `TEXT_BEGIN`) and SPEAKING (emits text until `TEXT_END`) | |
| - **Default sampling:** temperature 0.3, top-k 3 | |
| - **Default max new tokens:** 4096 per session | |
| - **License:** Apache-2.0 | |
| ## Repository Contents | |
| ```text | |
| Audio-Interaction/ | |
| βββ model-00001-of-00004.safetensors # LM weights, sharded (β4 GB each) | |
| βββ model-00002-of-00004.safetensors | |
| βββ model-00003-of-00004.safetensors | |
| βββ model-00004-of-00004.safetensors | |
| βββ model.safetensors.index.json # Shard index consumed by safetensors loader | |
| βββ config.json # Top-level model config | |
| βββ generation_config.json # Generation defaults | |
| βββ model_config.yaml # GPT config consumed by Config.from_file | |
| βββ hyperparameters.yaml # Training-time hyperparameters (reference) | |
| βββ tokenizer.json # Tokenizer | |
| βββ tokenizer_config.json | |
| βββ MiniOmni3_ChunkwisedEncoder.pth # Audio encoder weights (Qwen2.5-Omni audio tower) | |
| βββ qwen25OmniConfig/ # Audio-encoder config (nested: thinker_config.audio_config) | |
| ``` | |
| ## Intended Use | |
| Audio-Interaction is intended for streaming conversational agents that need to react to audio as it arrives β for example, voice assistants that may interject mid-utterance, alarms that respond to ambient sound, or low-latency dialogue systems where waiting for a full utterance before replying is too slow. The model is not a transcription system; it produces a conversational reply (or silence) rather than a verbatim transcript. | |
| ## Quick Start | |
| ### Installation | |
| ```bash | |
| git clone https://github.com/xzf-thu/Audio-Interaction.git # TODO: confirm repo URL | |
| cd Audio-Interaction | |
| conda create -n Audio-Interaction python=3.10 -y | |
| conda activate Audio-Interaction | |
| pip install -r requirements.txt | |
| ``` | |
| ### Download the checkpoint | |
| From the `Audio-Interaction` project root, pull the weights into `checkpoints/`: | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| snapshot_download(repo_id="zhifeixie/Audio-Interaction", local_dir="checkpoints") | |
| ``` | |
| `snapshot_download` is the recommended path β it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats. | |
| ### Python Usage | |
| ```python | |
| from src.miniomni3.generate.run import run_inference | |
| run_inference( | |
| checkpoint_dir="checkpoints", | |
| audio_paths=["/path/to/audio.wav"], # offline mode: one round per path | |
| device="cuda:0", # or "mps" / "cpu" | |
| ) | |
| ``` | |
| For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round: | |
| ```python | |
| run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0") | |
| ``` | |
| ## Streaming Protocol | |
| A single session looks like: | |
| ```text | |
| [system prompt tokens] | |
| ββββ LISTENING ββββ | |
| β AUDIO_BEGIN PAD*10 ASSISTANT β KEEP_SILENCE (keep listening) | |
| β AUDIO_BEGIN PAD*10 ASSISTANT β TEXT_BEGIN EMOTION (start replying) | |
| βββββββββββββββββββ | |
| ββββ SPEAKING βββββ | |
| β β¦ text tokens β¦ TEXT_END (reply finished) | |
| βββββββββββββββββββ | |
| ββββ LISTENING ββββ (next audio chunk) | |
| β¦ | |
| ``` | |
| The model is trained to emit at most one `TEXT_BEGIN` per audio chunk. Each assistant turn begins with `TEXT_BEGIN`, followed by an emotion token, the reply tokens, and `TEXT_END`. Turns starting with `KEEP_SILENCE` indicate the model chose not to respond to that chunk. | |
| ## Training Summary | |
| <!-- TODO: fill in once details are public. | |
| Suggested fields: | |
| - Pretraining base | |
| - SFT / instruction-tuning data | |
| - Streaming-objective data construction (how KEEP_SILENCE / TEXT_BEGIN supervision was generated) | |
| - Total tokens / hours of audio | |
| - Hardware and duration | |
| --> | |
| ## Evaluation | |
| <!-- TODO: fill in once benchmarks are decided. | |
| Candidate metrics: | |
| - Spoken-QA accuracy on held-out audio prompts | |
| - False-trigger rate on ambient / non-speech audio (lower is better) | |
| - Response-onset latency in encoder chunks from end of question | |
| - Text quality of replies (e.g. GPT-judge or human preference) | |
| --> | |
| ## Limitations | |
| - The model produces text, not speech. Pair it with a TTS system for end-to-end voice interaction. | |
| - Audio must be 16 kHz mono; non-conforming inputs are resampled by `whisper.load_audio` and padded to 0.4-second boundaries before encoding. | |
| - Decisions are made at 0.4-second granularity (one encoder chunk), which sets a floor on response-onset latency. | |
| - Trailing partial audio chunks shorter than 10 encoder frames are dropped before generation. | |
| ## Citation | |
| <!-- TODO: replace with the real arxiv id and year once published. --> | |
| ```bibtex | |
| @misc{xie_miniomni3, | |
| title = {Audio-Interaction: Streaming Audio-In, Text-Out Conversational Modeling}, | |
| author = {Zhifei Xie and collaborators}, | |
| year = {2026}, | |
| note = {Preprint in preparation} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| Audio-Interaction builds on the Qwen2.5-Omni audio encoder. We thank the Qwen team and the maintainers of OpenAI Whisper for the audio-loading utilities used in this project. |