tamarher commited on
Commit
f4742ec
·
verified ·
1 Parent(s): 3e547f0

Update model card: Speaker 0/1 format, cloning, upstream defaults

Browse files
Files changed (1) hide show
  1. README.md +28 -9
README.md CHANGED
@@ -10,6 +10,7 @@ tags:
10
  - tts
11
  - speech
12
  - voice-conditioned
 
13
  - long-form
14
  - diffusion
15
  - apple-silicon
@@ -19,9 +20,7 @@ tags:
19
 
20
  # VibeVoice — MLX
21
 
22
- VibeVoice converted and quantized for native MLX inference on Apple Silicon.
23
-
24
- A hybrid LLM + diffusion architecture built for long-form speech and voice-conditioned generation. Works in greedy or sampled mode, and produces natural-sounding output at scale.
25
 
26
  ## Variants
27
 
@@ -31,20 +30,40 @@ A hybrid LLM + diffusion architecture built for long-form speech and voice-condi
31
 
32
  ## How to Get Started
33
 
34
- Via [mlx-speech](https://github.com/appautomaton/mlx-speech):
35
-
36
  ```bash
37
- python scripts/generate_vibevoice.py \
38
  --text "Hello from VibeVoice." \
39
  --output outputs/vibevoice.wav
40
  ```
41
 
42
- ```python
43
- from mlx_speech.generation import VibeVoiceModel
 
 
 
 
 
 
 
44
 
45
- model = VibeVoiceModel.from_path("mlx-int8")
 
 
 
 
 
46
  ```
47
 
 
 
 
 
 
 
 
 
 
48
  ## Model Details
49
 
50
  VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.
 
10
  - tts
11
  - speech
12
  - voice-conditioned
13
+ - multi-speaker
14
  - long-form
15
  - diffusion
16
  - apple-silicon
 
20
 
21
  # VibeVoice — MLX
22
 
23
+ VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.
 
 
24
 
25
  ## Variants
26
 
 
30
 
31
  ## How to Get Started
32
 
33
+ **Single speaker:**
 
34
  ```bash
35
+ python scripts/generate/vibevoice.py \
36
  --text "Hello from VibeVoice." \
37
  --output outputs/vibevoice.wav
38
  ```
39
 
40
+ **Multi-speaker dialogue** — speaker labels are 0-based:
41
+ ```bash
42
+ python scripts/generate/vibevoice.py \
43
+ --text "Speaker 0: Have you tried VibeVoice?
44
+ Speaker 1: Not yet. Does it need PyTorch?
45
+ Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
46
+ Speaker 1: That is impressive." \
47
+ --output outputs/dialogue.wav
48
+ ```
49
 
50
+ **Voice cloning** — one reference WAV per speaker:
51
+ ```bash
52
+ python scripts/generate/vibevoice.py \
53
+ --text "Speaker 0: This is cloned from the reference." \
54
+ --reference-audio-speaker0 ref_speaker0.wav \
55
+ --output outputs/clone.wav
56
  ```
57
 
58
+ Up to 4 speakers supported: `--reference-audio-speaker0` through `--reference-audio-speaker3`.
59
+
60
+ **Default generation settings** (matching upstream):
61
+ - Greedy decoding (deterministic)
62
+ - Seed: 42
63
+ - Diffusion steps: 20
64
+
65
+ Add `--no-greedy` to enable temperature + top-p sampling.
66
+
67
  ## Model Details
68
 
69
  VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.