appautomaton
/

vibevoice-mlx

@@ -10,6 +10,7 @@ tags:
 - tts
 - speech
 - voice-conditioned
 - long-form
 - diffusion
 - apple-silicon
@@ -19,9 +20,7 @@ tags:
 # VibeVoice — MLX
-VibeVoice converted and quantized for native MLX inference on Apple Silicon.
-A hybrid LLM + diffusion architecture built for long-form speech and voice-conditioned generation. Works in greedy or sampled mode, and produces natural-sounding output at scale.
 ## Variants
@@ -31,20 +30,40 @@ A hybrid LLM + diffusion architecture built for long-form speech and voice-condi
 ## How to Get Started
-Via [mlx-speech](https://github.com/appautomaton/mlx-speech):
 ```bash
-python scripts/generate_vibevoice.py \
   --text "Hello from VibeVoice." \
   --output outputs/vibevoice.wav
 ```
-```python
-from mlx_speech.generation import VibeVoiceModel
-model = VibeVoiceModel.from_path("mlx-int8")
 ```
 ## Model Details
 VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.

 - tts
 - speech
 - voice-conditioned
+- multi-speaker
 - long-form
 - diffusion
 - apple-silicon
 # VibeVoice — MLX
+VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.
 ## Variants
 ## How to Get Started
+**Single speaker:**
 ```bash
+python scripts/generate/vibevoice.py \
   --text "Hello from VibeVoice." \
   --output outputs/vibevoice.wav
 ```
+**Multi-speaker dialogue** — speaker labels are 0-based:
+```bash
+python scripts/generate/vibevoice.py \
+  --text "Speaker 0: Have you tried VibeVoice?
+Speaker 1: Not yet. Does it need PyTorch?
+Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
+Speaker 1: That is impressive." \
+  --output outputs/dialogue.wav
+```
+**Voice cloning** — one reference WAV per speaker:
+```bash
+python scripts/generate/vibevoice.py \
+  --text "Speaker 0: This is cloned from the reference." \
+  --reference-audio-speaker0 ref_speaker0.wav \
+  --output outputs/clone.wav
 ```
+Up to 4 speakers supported: `--reference-audio-speaker0` through `--reference-audio-speaker3`.
+**Default generation settings** (matching upstream):
+- Greedy decoding (deterministic)
+- Seed: 42
+- Diffusion steps: 20
+Add `--no-greedy` to enable temperature + top-p sampling.
 ## Model Details
 VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.