Foradc Claude Sonnet 4.6 commited on
Commit
8b035dd
·
1 Parent(s): 737927a

docs: update README for 6-engine Voxtral release

Browse files

- 6 engines (was 5), 8 pills (was 7)
- Add Voxtral row in engine table with FR★ quality note
- Add Voxtral server setup section (vLLM-Omni, HF token, narrator ref)
- Add VOXTRAL_URL env var note
- Update project structure with voxtral_server.py and make_narrator_reference.py
- Add Voxtral model in auto-download table (~8GB, gated)
- Add voxtral/mistral/vllm to GitHub topics
- Add Mistral AI and arXiv:2508.17494 credits

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +37 -7
README.md CHANGED
@@ -10,7 +10,7 @@ pinned: false
10
 
11
  # 🎙 Boovore — Multi-Engine TTS Studio
12
 
13
- **Boovore** is a self-hosted, GPU-accelerated Text-to-Speech studio with 5 best-in-class engines and a built-in audiobook generator. Run it on any CUDA machine (tested on RTX 3090) via a clean, dark-mode web UI.
14
 
15
  > **Name**: Boovore = *Book* + *Devour* — built to devour books in audio.
16
 
@@ -27,6 +27,9 @@ pinned: false
27
  | **F5-TTS** | ★★★★ | ⚡⚡ | French voice cloning |
28
  | **Fish-Speech 1.5** | ★★★★★ | ⚡⚡ | Multilingual voice cloning (fishaudio) |
29
  | **Qwen3-TTS** | ★★★★★ | ⚡ | Clone · Custom · Voice Design |
 
 
 
30
 
31
  ---
32
 
@@ -38,10 +41,12 @@ In your Space → **Settings → Variables and secrets**, set:
38
  |---|---|---|
39
  | `kokoro,f5` | CPU (free tier) | Kokoro · F5-TTS |
40
  | `kokoro,f5,chatterbox` | GPU T4 (~6 GB) | + Chatterbox |
41
- | `all` | GPU A10G / A100 | All 5 engines + Qwen3 |
42
 
43
  > Default is `all` — on free CPU tier, set `kokoro,f5` to avoid crashes.
44
 
 
 
45
  ---
46
 
47
  ## 🚀 Quick Start (Vast.ai / GPU server)
@@ -70,7 +75,26 @@ pip3 install -e . --no-deps
70
  huggingface-cli download fishaudio/fish-speech-1.5 --local-dir /root/fish-speech-model
71
  ```
72
 
73
- ### 2. Start the server
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ```bash
76
  nohup python3 server.py --port 7860 >> /root/server.log 2>&1 &
@@ -88,7 +112,7 @@ ssh -p <PORT> root@<HOST> -L 7860:localhost:7860 -N
88
 
89
  ## 📖 Features
90
 
91
- - **TTS Studio** — one-click engine selector (7 pills), single generate button
92
  - **Audiobook Generator** — import `.txt` / `.pdf` / `.epub`, auto-detect chapters, batch generate with any engine, download per chapter or merge into one WAV
93
  - **Voice Cloning** — upload a reference audio clip (Chatterbox, F5-TTS, Fish-Speech, Qwen3)
94
  - **Real-time metrics** — TTFA, RTF, duration, buffer
@@ -100,8 +124,11 @@ ssh -p <PORT> root@<HOST> -L 7860:localhost:7860 -N
100
  ## 🗂 Project Structure
101
 
102
  ```
103
- server.py — FastAPI backend (5 engines)
104
- index.html — UI single-page (vanilla JS, aucune dépendance frontend)
 
 
 
105
  requirements.txt
106
  Dockerfile
107
  ```
@@ -126,12 +153,13 @@ Dockerfile
126
  | `SWivid/F5-TTS` | ~1.2 GB | F5-TTS |
127
  | `resemble-ai/chatterbox` | ~1.5 GB | Chatterbox |
128
  | `fishaudio/fish-speech-1.5` | ~1.4 GB | Fish-Speech |
 
129
 
130
  ---
131
 
132
  ## 🏷️ GitHub Topics
133
 
134
- `text-to-speech` `tts` `voice-cloning` `audiobook` `french-tts` `kokoro` `f5-tts` `fish-speech` `chatterbox` `qwen3` `fastapi` `cuda` `self-hosted` `gpu` `french` `multilingual`
135
 
136
  ---
137
 
@@ -142,6 +170,8 @@ Dockerfile
142
  - [Chatterbox](https://github.com/resemble-ai/chatterbox) — ResembleAI
143
  - [F5-TTS](https://github.com/SWivid/F5-TTS) — SWivid
144
  - [Kokoro](https://github.com/hexgrad/kokoro) — hexgrad
 
 
145
 
146
  ---
147
 
 
10
 
11
  # 🎙 Boovore — Multi-Engine TTS Studio
12
 
13
+ **Boovore** is a self-hosted, GPU-accelerated Text-to-Speech studio with 6 best-in-class engines and a built-in audiobook generator. Run it on any CUDA machine (tested on RTX 3090) via a clean, dark-mode web UI.
14
 
15
  > **Name**: Boovore = *Book* + *Devour* — built to devour books in audio.
16
 
 
27
  | **F5-TTS** | ★★★★ | ⚡⚡ | French voice cloning |
28
  | **Fish-Speech 1.5** | ★★★★★ | ⚡⚡ | Multilingual voice cloning (fishaudio) |
29
  | **Qwen3-TTS** | ★★★★★ | ⚡ | Clone · Custom · Voice Design |
30
+ | **Voxtral 4B** | ★★★★★ | ⚡⚡ | French-first, 68% win vs ElevenLabs (Mistral AI) |
31
+
32
+ > **Voxtral** uses vLLM-Omni (`mistralai/Voxtral-4B-TTS-2603`) with voice cloning via a reference WAV. Start it separately with `python3 voxtral_server.py`.
33
 
34
  ---
35
 
 
41
  |---|---|---|
42
  | `kokoro,f5` | CPU (free tier) | Kokoro · F5-TTS |
43
  | `kokoro,f5,chatterbox` | GPU T4 (~6 GB) | + Chatterbox |
44
+ | `all` | GPU A10G / A100 | All engines + Qwen3 |
45
 
46
  > Default is `all` — on free CPU tier, set `kokoro,f5` to avoid crashes.
47
 
48
+ For **Voxtral**, also set `VOXTRAL_URL` to point to your vLLM-Omni server (default: `http://localhost:8000`).
49
+
50
  ---
51
 
52
  ## 🚀 Quick Start (Vast.ai / GPU server)
 
75
  huggingface-cli download fishaudio/fish-speech-1.5 --local-dir /root/fish-speech-model
76
  ```
77
 
78
+ ### 2. (Optional) Start Voxtral TTS server
79
+
80
+ Voxtral requires a separate vLLM-Omni process (~8 GB VRAM). Needs a HuggingFace token — accept the CC BY-NC license at [mistralai/Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603) first.
81
+
82
+ ```bash
83
+ pip install "vllm[audio]>=0.18.0" httpx soundfile
84
+ export HF_TOKEN=hf_xxxx
85
+ nohup python3 voxtral_server.py >> /root/voxtral.log 2>&1 &
86
+ # Wait 5-10 min for model download + load (first run only)
87
+ ```
88
+
89
+ Optionally generate a narrator reference WAV (for voice cloning):
90
+
91
+ ```bash
92
+ # While the Qwen3 server is running:
93
+ python3 make_narrator_reference.py
94
+ # Output: /workspace/narrator_reference.wav
95
+ ```
96
+
97
+ ### 3. Start the main server
98
 
99
  ```bash
100
  nohup python3 server.py --port 7860 >> /root/server.log 2>&1 &
 
112
 
113
  ## 📖 Features
114
 
115
+ - **TTS Studio** — one-click engine selector (8 pills), single generate button
116
  - **Audiobook Generator** — import `.txt` / `.pdf` / `.epub`, auto-detect chapters, batch generate with any engine, download per chapter or merge into one WAV
117
  - **Voice Cloning** — upload a reference audio clip (Chatterbox, F5-TTS, Fish-Speech, Qwen3)
118
  - **Real-time metrics** — TTFA, RTF, duration, buffer
 
124
  ## 🗂 Project Structure
125
 
126
  ```
127
+ server.py — FastAPI backend (6 engines)
128
+ index.html — UI single-page (vanilla JS, no frontend deps)
129
+ voxtral_server.py — vLLM-Omni server manager (start/stop/status)
130
+ make_narrator_reference.py — Generate narrator reference WAV via Qwen3
131
+ narrator_reference.wav — (generated) voice clone reference for Voxtral
132
  requirements.txt
133
  Dockerfile
134
  ```
 
153
  | `SWivid/F5-TTS` | ~1.2 GB | F5-TTS |
154
  | `resemble-ai/chatterbox` | ~1.5 GB | Chatterbox |
155
  | `fishaudio/fish-speech-1.5` | ~1.4 GB | Fish-Speech |
156
+ | `mistralai/Voxtral-4B-TTS-2603` | ~8 GB (BF16) | Voxtral (gated — HF token required) |
157
 
158
  ---
159
 
160
  ## 🏷️ GitHub Topics
161
 
162
+ `text-to-speech` `tts` `voice-cloning` `audiobook` `french-tts` `kokoro` `f5-tts` `fish-speech` `chatterbox` `qwen3` `voxtral` `mistral` `vllm` `fastapi` `cuda` `self-hosted` `gpu` `french` `multilingual`
163
 
164
  ---
165
 
 
170
  - [Chatterbox](https://github.com/resemble-ai/chatterbox) — ResembleAI
171
  - [F5-TTS](https://github.com/SWivid/F5-TTS) — SWivid
172
  - [Kokoro](https://github.com/hexgrad/kokoro) — hexgrad
173
+ - [Voxtral](https://mistral.ai) — Mistral AI (`mistralai/Voxtral-4B-TTS-2603`, CC BY-NC)
174
+ - French prosody preprocessing inspired by [arXiv:2508.17494](https://arxiv.org/abs/2508.17494)
175
 
176
  ---
177