zhifeixie commited on
Commit
c0caf87
Β·
verified Β·
1 Parent(s): c1144d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -22
README.md CHANGED
@@ -3,28 +3,25 @@ language:
3
  - en
4
  - zh
5
  license: apache-2.0
 
6
  pipeline_tag: audio-text-to-text
 
 
7
  tags:
8
  - speech-language-model
9
  - streaming
10
  - audio
11
  - multimodal
12
  - qwen2.5-omni
13
- datasets:
14
- - zhifeixie/StreamAudio-2M
15
- base_model:
16
- - Qwen/Qwen2.5-Omni-3B
17
  ---
18
  # Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
19
 
20
- [**Code**](https://github.com/xzf-thu/Mini-Omni3) <!-- TODO: confirm repo URL -->
21
 
22
  Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
23
 
24
  This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
25
 
26
- The release contains the Mini-Omni3 language-model weights (sharded safetensors), a chunk-wise audio encoder adapted from Qwen2.5-Omni, and the matching tokenizer and model config.
27
-
28
  ## Model Details
29
 
30
  - **Model name:** Mini-Omni3
@@ -40,14 +37,19 @@ The release contains the Mini-Omni3 language-model weights (sharded safetensors)
40
 
41
  ```text
42
  Mini-Omni3/
43
- β”œβ”€β”€ MiniOmni3_LM_sharded/ # Sharded safetensors of the LM weights
44
- β”‚ β”œβ”€β”€ model.safetensors.index.json
45
- β”‚ └── model-0000N-of-0000N.safetensors
46
- β”œβ”€β”€ MiniOmni3_ChunkwisedEncoder.pth # Audio encoder weights (Qwen2.5-Omni audio tower)
47
- β”œβ”€β”€ qwen_2_5_omni_config/ # Audio-encoder config (nested: thinker_config.audio_config)
48
- β”œβ”€β”€ model_config.yaml # GPT config consumed by Config.from_file
49
- β”œβ”€β”€ tokenizer.json # Tokenizer
50
- └── README.md # This card
 
 
 
 
 
51
  ```
52
 
53
  ## Intended Use
@@ -68,23 +70,23 @@ pip install -r requirements.txt
68
 
69
  ### Download the checkpoint
70
 
 
 
71
  ```python
72
  from huggingface_hub import snapshot_download
73
 
74
- local_dir = snapshot_download(
75
- repo_id="zhifeixie/Mini-Omni3",
76
- repo_type="model",
77
- )
78
- print(local_dir)
79
  ```
80
 
 
 
81
  ### Python Usage
82
 
83
  ```python
84
  from src.miniomni3.generate.run import run_inference
85
 
86
  run_inference(
87
- checkpoint_dir=local_dir, # the path snapshot_download returned
88
  audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
89
  device="cuda:0", # or "mps" / "cpu"
90
  )
@@ -93,7 +95,7 @@ run_inference(
93
  For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
94
 
95
  ```python
96
- run_inference(checkpoint_dir=local_dir, rounds=5, device="cuda:0")
97
  ```
98
 
99
  ## Streaming Protocol
 
3
  - en
4
  - zh
5
  license: apache-2.0
6
+ library_name: transformers
7
  pipeline_tag: audio-text-to-text
8
+ datasets:
9
+ - zhifeixie/Mini-Omni3-Data # TODO: replace with the actual dataset repo id
10
  tags:
11
  - speech-language-model
12
  - streaming
13
  - audio
14
  - multimodal
15
  - qwen2.5-omni
 
 
 
 
16
  ---
17
  # Mini-Omni3: Streaming Audio-In, Text-Out Conversational Model
18
 
19
+ [**Code**](https://github.com/xzf-thu/Mini-Omni3) | [**Model**](https://huggingface.co/zhifeixie/Mini-Omni3) | [**Dataset**](https://huggingface.co/datasets/zhifeixie/Mini-Omni3-Data) <!-- TODO: confirm code repo URL and dataset repo id -->
20
 
21
  Mini-Omni3 is a streaming speech-language model that listens to audio in real time and decides, at each audio chunk, whether to keep listening or to start replying with text. The model alternates between a **LISTENING** state, where it consumes one encoder-output chunk per step and emits either `KEEP_SILENCE` or `TEXT_BEGIN`, and a **SPEAKING** state, where it autoregressively generates a text turn until `TEXT_END` and then returns to listening for the next chunk.
22
 
23
  This design lets the model handle both spoken questions ("answer it") and ambient sounds ("decide based on the sound whether help is needed") within a single streaming session, without an external VAD or turn-taking heuristic.
24
 
 
 
25
  ## Model Details
26
 
27
  - **Model name:** Mini-Omni3
 
37
 
38
  ```text
39
  Mini-Omni3/
40
+ β”œβ”€β”€ model-00001-of-00004.safetensors # LM weights, sharded (β‰ˆ4 GB each)
41
+ β”œβ”€β”€ model-00002-of-00004.safetensors
42
+ β”œβ”€β”€ model-00003-of-00004.safetensors
43
+ β”œβ”€β”€ model-00004-of-00004.safetensors
44
+ β”œβ”€β”€ model.safetensors.index.json # Shard index consumed by safetensors loader
45
+ β”œβ”€β”€ config.json # Top-level model config
46
+ β”œβ”€β”€ generation_config.json # Generation defaults
47
+ β”œβ”€β”€ model_config.yaml # GPT config consumed by Config.from_file
48
+ β”œβ”€β”€ hyperparameters.yaml # Training-time hyperparameters (reference)
49
+ β”œβ”€β”€ tokenizer.json # Tokenizer
50
+ β”œβ”€β”€ tokenizer_config.json
51
+ β”œβ”€β”€ MiniOmni3_ChunkwisedEncoder.pth # Audio encoder weights (Qwen2.5-Omni audio tower)
52
+ └── qwen25OmniConfig/ # Audio-encoder config (nested: thinker_config.audio_config)
53
  ```
54
 
55
  ## Intended Use
 
70
 
71
  ### Download the checkpoint
72
 
73
+ From the `Mini-Omni3` project root, pull the weights into `checkpoints/`:
74
+
75
  ```python
76
  from huggingface_hub import snapshot_download
77
 
78
+ snapshot_download(repo_id="zhifeixie/Mini-Omni3", local_dir="checkpoints")
 
 
 
 
79
  ```
80
 
81
+ `snapshot_download` is the recommended path β€” it pulls every file, resumes on interruption, and is the only way the download counter on this page advances. Please avoid `git clone` of the HF repo or the web "Download" button if you want your run reflected in the stats.
82
+
83
  ### Python Usage
84
 
85
  ```python
86
  from src.miniomni3.generate.run import run_inference
87
 
88
  run_inference(
89
+ checkpoint_dir="checkpoints",
90
  audio_paths=["/path/to/audio.wav"], # offline mode: one round per path
91
  device="cuda:0", # or "mps" / "cpu"
92
  )
 
95
  For interactive use, omit `audio_paths` and `run_inference` will prompt for an audio path each round:
96
 
97
  ```python
98
+ run_inference(checkpoint_dir="checkpoints", rounds=5, device="cuda:0")
99
  ```
100
 
101
  ## Streaming Protocol