Text-to-Speech
MLX
Safetensors
Chinese
English
tts
speech
voice-conditioned
multi-speaker
long-form
diffusion
apple-silicon
quantized
8bit
Instructions to use appautomaton/vibevoice-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use appautomaton/vibevoice-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir vibevoice-mlx appautomaton/vibevoice-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Update model card: Speaker 0/1 format, cloning, upstream defaults
Browse files
README.md
CHANGED
|
@@ -10,6 +10,7 @@ tags:
|
|
| 10 |
- tts
|
| 11 |
- speech
|
| 12 |
- voice-conditioned
|
|
|
|
| 13 |
- long-form
|
| 14 |
- diffusion
|
| 15 |
- apple-silicon
|
|
@@ -19,9 +20,7 @@ tags:
|
|
| 19 |
|
| 20 |
# VibeVoice — MLX
|
| 21 |
|
| 22 |
-
VibeVoice converted and quantized for native MLX inference on Apple Silicon.
|
| 23 |
-
|
| 24 |
-
A hybrid LLM + diffusion architecture built for long-form speech and voice-conditioned generation. Works in greedy or sampled mode, and produces natural-sounding output at scale.
|
| 25 |
|
| 26 |
## Variants
|
| 27 |
|
|
@@ -31,20 +30,40 @@ A hybrid LLM + diffusion architecture built for long-form speech and voice-condi
|
|
| 31 |
|
| 32 |
## How to Get Started
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
```bash
|
| 37 |
-
python scripts/
|
| 38 |
--text "Hello from VibeVoice." \
|
| 39 |
--output outputs/vibevoice.wav
|
| 40 |
```
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
```
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
## Model Details
|
| 49 |
|
| 50 |
VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.
|
|
|
|
| 10 |
- tts
|
| 11 |
- speech
|
| 12 |
- voice-conditioned
|
| 13 |
+
- multi-speaker
|
| 14 |
- long-form
|
| 15 |
- diffusion
|
| 16 |
- apple-silicon
|
|
|
|
| 20 |
|
| 21 |
# VibeVoice — MLX
|
| 22 |
|
| 23 |
+
VibeVoice Large converted and quantized for native MLX inference on Apple Silicon. Hybrid LLM + diffusion architecture for long-form speech, multi-speaker dialogue, and voice cloning.
|
|
|
|
|
|
|
| 24 |
|
| 25 |
## Variants
|
| 26 |
|
|
|
|
| 30 |
|
| 31 |
## How to Get Started
|
| 32 |
|
| 33 |
+
**Single speaker:**
|
|
|
|
| 34 |
```bash
|
| 35 |
+
python scripts/generate/vibevoice.py \
|
| 36 |
--text "Hello from VibeVoice." \
|
| 37 |
--output outputs/vibevoice.wav
|
| 38 |
```
|
| 39 |
|
| 40 |
+
**Multi-speaker dialogue** — speaker labels are 0-based:
|
| 41 |
+
```bash
|
| 42 |
+
python scripts/generate/vibevoice.py \
|
| 43 |
+
--text "Speaker 0: Have you tried VibeVoice?
|
| 44 |
+
Speaker 1: Not yet. Does it need PyTorch?
|
| 45 |
+
Speaker 0: No. Pure MLX, runs locally on Apple Silicon.
|
| 46 |
+
Speaker 1: That is impressive." \
|
| 47 |
+
--output outputs/dialogue.wav
|
| 48 |
+
```
|
| 49 |
|
| 50 |
+
**Voice cloning** — one reference WAV per speaker:
|
| 51 |
+
```bash
|
| 52 |
+
python scripts/generate/vibevoice.py \
|
| 53 |
+
--text "Speaker 0: This is cloned from the reference." \
|
| 54 |
+
--reference-audio-speaker0 ref_speaker0.wav \
|
| 55 |
+
--output outputs/clone.wav
|
| 56 |
```
|
| 57 |
|
| 58 |
+
Up to 4 speakers supported: `--reference-audio-speaker0` through `--reference-audio-speaker3`.
|
| 59 |
+
|
| 60 |
+
**Default generation settings** (matching upstream):
|
| 61 |
+
- Greedy decoding (deterministic)
|
| 62 |
+
- Seed: 42
|
| 63 |
+
- Diffusion steps: 20
|
| 64 |
+
|
| 65 |
+
Add `--no-greedy` to enable temperature + top-p sampling.
|
| 66 |
+
|
| 67 |
## Model Details
|
| 68 |
|
| 69 |
VibeVoice uses a 9B-parameter hybrid architecture combining a Qwen2 language model backbone with a continuous diffusion acoustic decoder. Converted to MLX with explicit weight remapping — no PyTorch at inference time.
|