zhifeixie
/

AudioInteraction

@@ -3,28 +3,25 @@ language:
 - en
 - zh
 license: apache-2.0
 pipeline_tag: audio-text-to-text
 tags:
 - speech-language-model
 - streaming
 - audio
 - multimodal
 - qwen2.5-omni
-datasets:
-- zhifeixie/StreamAudio-2M
-base_model:
-- Qwen/Qwen2.5-Omni-3B
 ---
 # Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
-[**Code**](https://github.com/xzf-thu/Mini-Omni3) <!-- TODO: confirm repo URL -->
 Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
 This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
-The release contains the Mini-Omni3 language-model weights (sharded safetensors), a chunk-wise audio encoder adapted from Qwen2.5-Omni, and the matching tokenizer and model config.
 ## Model Details
 - **Model name:** Mini-Omni3
@@ -40,14 +37,19 @@ The release contains the Mini-Omni3 language-model weights (sharded safetensors)
 ```text
 Mini-Omni3/
-├── MiniOmni3_LM_sharded/                  # Sharded safetensors of the LM weights
-│   ├── model.safetensors.index.json
-│   └── model-0000N-of-0000N.safetensors
-├── MiniOmni3_ChunkwisedEncoder.pth        # Audio encoder weights (Qwen2.5-Omni audio tower)
-├── qwen_2_5_omni_config/                  # Audio-encoder config (nested: thinker_config.audio_config)
-├── model_config.yaml                      # GPT config consumed by Config.from_file
-├── tokenizer.json                         # Tokenizer
-└── README.md                              # This card
 ```
 ## Intended Use
@@ -68,23 +70,23 @@ pip install -r requirements.txt
 ### Download the checkpoint
 ```python
 from huggingface_hub import snapshot_download
-local_dir = snapshot_download(
-    repo_id="zhifeixie/Mini-Omni3",
-    repo_type="model",
-)
-print(local_dir)
 ```
 ### Python Usage
 ```python
 from src.miniomni3.generate.run import run_inference
 run_inference(
-    checkpoint_dir=local_dir,           # the path snapshot_download returned
     audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
     device="cuda:0",                    # or "mps" / "cpu"
 )
@@ -93,7 +95,7 @@ run_inference(
 For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
 ```python
-run_inference(checkpoint_dir=local_dir, rounds=5, device="cuda:0")
 ```
 ## Streaming Protocol

 - en
 - zh
 license: apache-2.0
+library_name: transformers
 pipeline_tag: audio-text-to-text
+datasets:
+- zhifeixie/Mini-Omni3-Data  # TODO: replace with the actual dataset repo id
 tags:
 - speech-language-model
 - streaming
 - audio
 - multimodal
 - qwen2.5-omni
 ---
 # Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
+[**Code**](https://github.com/xzf-thu/Mini-Omni3) | [**Model**](https://huggingface.co/zhifeixie/Mini-Omni3) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Mini-Omni3-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
 Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
 This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
 ## Model Details
 - **Model name:** Mini-Omni3
 ```text
 Mini-Omni3/
+├── model-00001-of-00004.safetensors      # LM weights, sharded (≈4 GB each)
+├── model-00002-of-00004.safetensors
+├── model-00003-of-00004.safetensors
+├── model-00004-of-00004.safetensors
+├── model.safetensors.index.json          # Shard index consumed by safetensors loader
+├── config.json                           # Top-level model config
+├── generation_config.json                # Generation defaults
+├── model_config.yaml                     # GPT config consumed by Config.from_file
+├── hyperparameters.yaml                  # Training-time hyperparameters (reference)
+├── tokenizer.json                        # Tokenizer
+├── tokenizer_config.json
+├── MiniOmni3_ChunkwisedEncoder.pth       # Audio encoder weights (Qwen2.5-Omni audio tower)
+└── qwen25OmniConfig/                     # Audio-encoder config (nested: thinker_config.audio_config)
 ```
 ## Intended Use
 ### Download the checkpoint
+From the `Mini-Omni3` project root, pull the weights into `checkpoints/`:
 ```python
 from huggingface_hub import snapshot_download
+snapshot_download(repo_id="zhifeixie/Mini-Omni3", local_dir="checkpoints")
 ```
+`snapshot_download` is the recommended path — it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
 ### Python Usage
 ```python
 from src.miniomni3.generate.run import run_inference
 run_inference(
+    checkpoint_dir="checkpoints",
     audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
     device="cuda:0",                    # or "mps" / "cpu"
 )
 For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
 ```python
+run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")
 ```
 ## Streaming Protocol