CherithCutestory commited on
Commit
9e71d18
·
1 Parent(s): 5fe511c

First chatterbox engine container

Browse files
Files changed (4) hide show
  1. Dockerfile +28 -0
  2. README.md +70 -6
  3. app.py +312 -0
  4. requirements.txt +9 -0
Dockerfile ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ RUN apt-get update && apt-get install -y --no-install-recommends \
4
+ build-essential \
5
+ libsndfile1 \
6
+ ffmpeg \
7
+ git \
8
+ rubberband-cli \
9
+ librubberband-dev \
10
+ && rm -rf /var/lib/apt/lists/*
11
+
12
+ RUN useradd -m -u 1000 user
13
+ WORKDIR /app
14
+
15
+ COPY requirements.txt .
16
+ RUN pip install --no-cache-dir -r requirements.txt
17
+
18
+ COPY . .
19
+
20
+ RUN chown -R user:user /app
21
+ USER user
22
+
23
+ ENV PYTHONUNBUFFERED=1
24
+ ENV HF_HOME=/app/.cache/huggingface
25
+
26
+ EXPOSE 7860
27
+
28
+ CMD ["sh", "-c", "OMP_NUM_THREADS=4 exec uvicorn app:app --host 0.0.0.0 --port 7860"]
README.md CHANGED
@@ -1,11 +1,75 @@
1
  ---
2
- title: Vlengine Chatterbox
3
- emoji: 🚀
4
- colorFrom: gray
5
- colorTo: blue
6
  sdk: docker
 
7
  pinned: false
8
- license: mit
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: VoxLibris Chatterbox TTS Engine
3
+ emoji: 🗣️
4
+ colorFrom: purple
5
+ colorTo: indigo
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
 
9
  ---
10
 
11
+ # VoxLibris Chatterbox TTS Engine
12
+
13
+ A HuggingFace Space that serves [Chatterbox TTS](https://github.com/resemble-ai/chatterbox)
14
+ as a REST API, implementing the
15
+ [VoxLibris TTS Engine API Contract](https://github.com/your-repo/docs/tts-api-contract.md).
16
+
17
+ ## Endpoints
18
+
19
+ ### POST /GetEngineDetails
20
+
21
+ Returns engine capabilities, supported emotions, and voice cloning support.
22
+
23
+ ### POST /ConvertTextToSpeech
24
+
25
+ Converts text to speech with voice cloning. Requires a `voice_to_clone_sample`
26
+ (base64-encoded WAV). Supports emotion-driven expressiveness via the exaggeration
27
+ parameter, mapped automatically from VoxLibris emotions.
28
+
29
+ ### GET /health
30
+
31
+ Returns model loading status.
32
+
33
+ ## Authentication
34
+
35
+ Set the `API_KEY` secret in your HuggingFace Space settings.
36
+ Requests must include `Authorization: Bearer <your-key>` header.
37
+ Leave `API_KEY` unset to disable authentication.
38
+
39
+ ## Voice Cloning
40
+
41
+ Chatterbox is a voice-cloning TTS engine — every request requires a reference
42
+ voice sample. Send a base64-encoded WAV file in the `voice_to_clone_sample`
43
+ field. A 6-15 second clear speech sample works best.
44
+
45
+ ## Emotion Support
46
+
47
+ Chatterbox controls expressiveness through its `exaggeration` parameter (0.0-1.0).
48
+ The engine automatically maps VoxLibris emotions to appropriate exaggeration levels:
49
+
50
+ | Emotion | Exaggeration | Description |
51
+ |-----------|-------------|---------------------------|
52
+ | neutral | 0.50 | Normal, conversational |
53
+ | calm | 0.40 | Subdued, relaxed |
54
+ | happy | 0.70 | Cheerful, upbeat |
55
+ | sad | 0.60 | Somber, downcast |
56
+ | angry | 0.85 | Intense, forceful |
57
+ | fear | 0.75 | Tense, urgent |
58
+ | excited | 0.90 | High energy, enthusiastic |
59
+ | surprise | 0.80 | Startled, astonished |
60
+
61
+ The `intensity` parameter (1-100) scales the exaggeration further.
62
+
63
+ ## Limits
64
+
65
+ - Maximum 300 characters per request (longer text is truncated at word boundary)
66
+ - Output: 24kHz mono 16-bit WAV
67
+
68
+ ## Deployment
69
+
70
+ 1. Create a new HuggingFace Space with **Docker** SDK
71
+ 2. Upload the contents of this folder
72
+ 3. Set the `API_KEY` secret in Space settings (optional)
73
+ 4. The model downloads automatically on first startup (~500 MB)
74
+ 5. Requires GPU (T4 minimum recommended)
75
+ 6. Register the Space URL in VoxLibris Settings under TTS Engine Management
app.py ADDED
@@ -0,0 +1,312 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ os.environ.setdefault("OMP_NUM_THREADS", "4")
3
+
4
+ import io
5
+ import base64
6
+ import tempfile
7
+ import logging
8
+ import wave
9
+ import numpy as np
10
+ import torch
11
+ import pyrubberband as pyrb
12
+ from contextlib import asynccontextmanager
13
+ from pathlib import Path
14
+ from fastapi import FastAPI, Request, HTTPException
15
+ from fastapi.responses import Response, JSONResponse, HTMLResponse
16
+ from pydantic import BaseModel, Field
17
+ from typing import Optional
18
+
19
+ logging.basicConfig(level=logging.INFO)
20
+ logger = logging.getLogger("chatterbox-engine")
21
+
22
+ BEARER_TOKEN = os.environ.get("API_KEY", "")
23
+ SAMPLE_RATE = 24000
24
+ BIT_DEPTH = 16
25
+ CHANNELS = 1
26
+ MAX_SECONDS = 30
27
+ MAX_CHARS = 300
28
+
29
+ EMOTION_EXAGGERATION_MAP = {
30
+ "neutral": 0.5,
31
+ "happy": 0.7,
32
+ "sad": 0.6,
33
+ "angry": 0.85,
34
+ "fear": 0.75,
35
+ "fearful": 0.75,
36
+ "surprise": 0.8,
37
+ "disgust": 0.7,
38
+ "excited": 0.9,
39
+ "calm": 0.4,
40
+ "confused": 0.5,
41
+ "anxious": 0.75,
42
+ "hopeful": 0.6,
43
+ "melancholy": 0.55,
44
+ }
45
+
46
+ EMOTION_CFG_MAP = {
47
+ "neutral": 0.5,
48
+ "happy": 0.3,
49
+ "sad": 0.6,
50
+ "angry": 0.3,
51
+ "fear": 0.4,
52
+ "fearful": 0.4,
53
+ "surprise": 0.3,
54
+ "disgust": 0.5,
55
+ "excited": 0.2,
56
+ "calm": 0.7,
57
+ "confused": 0.5,
58
+ "anxious": 0.4,
59
+ "hopeful": 0.4,
60
+ "melancholy": 0.6,
61
+ }
62
+
63
+ CANONICAL_EMOTIONS = [
64
+ "neutral", "happy", "sad", "angry", "fear",
65
+ "surprise", "disgust", "excited", "calm", "confused",
66
+ "anxious", "hopeful", "melancholy", "fearful",
67
+ ]
68
+
69
+ tts_model = None
70
+
71
+
72
+ def load_model():
73
+ global tts_model
74
+ from chatterbox.tts import ChatterboxTTS
75
+
76
+ device = "cuda" if torch.cuda.is_available() else "cpu"
77
+ logger.info(f"Loading Chatterbox TTS model on {device}...")
78
+ tts_model = ChatterboxTTS.from_pretrained(device=device)
79
+ logger.info("Chatterbox TTS model loaded successfully.")
80
+
81
+
82
+ @asynccontextmanager
83
+ async def lifespan(app: FastAPI):
84
+ load_model()
85
+ yield
86
+
87
+
88
+ app = FastAPI(title="Chatterbox TTS Engine", lifespan=lifespan)
89
+
90
+
91
+ def verify_auth(request: Request):
92
+ if not BEARER_TOKEN:
93
+ return
94
+ auth = request.headers.get("Authorization", "")
95
+ if auth != f"Bearer {BEARER_TOKEN}":
96
+ raise HTTPException(status_code=401, detail="Unauthorized")
97
+
98
+
99
+ def numpy_to_wav_bytes(audio_np: np.ndarray, sample_rate: int) -> bytes:
100
+ audio_np = np.clip(audio_np, -1.0, 1.0)
101
+ audio_int16 = (audio_np * 32767).astype(np.int16)
102
+
103
+ buf = io.BytesIO()
104
+ with wave.open(buf, "wb") as wf:
105
+ wf.setnchannels(CHANNELS)
106
+ wf.setsampwidth(2)
107
+ wf.setframerate(sample_rate)
108
+ wf.writeframes(audio_int16.tobytes())
109
+ return buf.getvalue()
110
+
111
+
112
+ class ConvertRequest(BaseModel):
113
+ input_text: str
114
+ builtin_voice_id: Optional[str] = None
115
+ voice_to_clone_sample: Optional[str] = None
116
+ random_seed: Optional[int] = None
117
+ emotion_set: list[str] = Field(default_factory=lambda: ["neutral"])
118
+ intensity: int = Field(default=50, ge=1, le=100)
119
+ volume: int = Field(default=75, ge=1, le=100)
120
+ speed_adjust: float = Field(default=0.0, ge=-5.0, le=5.0)
121
+ pitch_adjust: float = Field(default=0.0, ge=-5.0, le=5.0)
122
+
123
+
124
+ @app.post("/GetEngineDetails")
125
+ async def get_engine_details(request: Request):
126
+ verify_auth(request)
127
+
128
+ return {
129
+ "engine_id": "chatterbox",
130
+ "engine_name": "Chatterbox TTS",
131
+ "sample_rate": SAMPLE_RATE,
132
+ "bit_depth": BIT_DEPTH,
133
+ "channels": CHANNELS,
134
+ "max_seconds_per_conversion": MAX_SECONDS,
135
+ "supports_voice_cloning": True,
136
+ "builtin_voices": [],
137
+ "supported_emotions": CANONICAL_EMOTIONS,
138
+ "extra_properties": {
139
+ "model": "ResembleAI/chatterbox",
140
+ "max_characters": MAX_CHARS,
141
+ }
142
+ }
143
+
144
+
145
+ @app.post("/ConvertTextToSpeech")
146
+ async def convert_text_to_speech(request: Request):
147
+ verify_auth(request)
148
+
149
+ try:
150
+ body = await request.json()
151
+ req = ConvertRequest(**body)
152
+ except Exception as e:
153
+ return JSONResponse(
154
+ status_code=400,
155
+ content={"error": str(e), "error_code": "INVALID_REQUEST"}
156
+ )
157
+
158
+ if not req.input_text.strip():
159
+ return JSONResponse(
160
+ status_code=400,
161
+ content={"error": "Input text is empty", "error_code": "INVALID_REQUEST"}
162
+ )
163
+
164
+ if not req.voice_to_clone_sample:
165
+ return JSONResponse(
166
+ status_code=400,
167
+ content={
168
+ "error": "Chatterbox requires a voice sample for cloning. "
169
+ "Please provide a voice_to_clone_sample.",
170
+ "error_code": "CLONING_NOT_SUPPORTED"
171
+ }
172
+ )
173
+
174
+ if req.random_seed is not None and req.random_seed > 0:
175
+ torch.manual_seed(req.random_seed)
176
+ if torch.cuda.is_available():
177
+ torch.cuda.manual_seed(req.random_seed)
178
+
179
+ temp_files = []
180
+
181
+ try:
182
+ try:
183
+ wav_bytes = base64.b64decode(req.voice_to_clone_sample, validate=True)
184
+ except Exception:
185
+ return JSONResponse(
186
+ status_code=400,
187
+ content={
188
+ "error": "Invalid voice_to_clone_sample: not valid base64",
189
+ "error_code": "INVALID_REQUEST"
190
+ }
191
+ )
192
+
193
+ if len(wav_bytes) < 44:
194
+ return JSONResponse(
195
+ status_code=400,
196
+ content={
197
+ "error": "Invalid voice_to_clone_sample: file too small to be valid audio",
198
+ "error_code": "INVALID_REQUEST"
199
+ }
200
+ )
201
+
202
+ tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
203
+ tmp.write(wav_bytes)
204
+ tmp.close()
205
+ speaker_wav_path = tmp.name
206
+ temp_files.append(tmp.name)
207
+
208
+ text = req.input_text
209
+ if len(text) > MAX_CHARS:
210
+ truncated = text[:MAX_CHARS]
211
+ last_space = truncated.rfind(' ')
212
+ if last_space > MAX_CHARS * 0.6:
213
+ truncated = truncated[:last_space]
214
+ text = truncated
215
+ logger.warning(f"Text truncated to {len(text)} characters")
216
+
217
+ dominant_emotion = req.emotion_set[0].lower() if req.emotion_set else "neutral"
218
+ base_exaggeration = EMOTION_EXAGGERATION_MAP.get(dominant_emotion, 0.5)
219
+ intensity_factor = req.intensity / 50.0
220
+ exaggeration = min(1.0, max(0.0, base_exaggeration * intensity_factor))
221
+
222
+ cfg_weight = EMOTION_CFG_MAP.get(dominant_emotion, 0.5)
223
+
224
+ temperature = 0.8
225
+
226
+ logger.info(
227
+ f"Generating with Chatterbox: emotion={dominant_emotion}, "
228
+ f"exaggeration={exaggeration:.2f}, cfg={cfg_weight:.2f}, "
229
+ f"text_len={len(text)}"
230
+ )
231
+
232
+ wav = tts_model.generate(
233
+ text,
234
+ audio_prompt_path=speaker_wav_path,
235
+ exaggeration=exaggeration,
236
+ temperature=temperature,
237
+ cfg_weight=cfg_weight,
238
+ )
239
+
240
+ audio_np = wav.squeeze().cpu().numpy().astype(np.float32)
241
+
242
+ if req.speed_adjust != 0.0:
243
+ speed_factor = 1.0 + (req.speed_adjust / 100.0)
244
+ speed_factor = max(0.5, min(2.0, speed_factor))
245
+ if abs(speed_factor - 1.0) > 0.01:
246
+ audio_np = pyrb.time_stretch(audio_np, SAMPLE_RATE, speed_factor)
247
+
248
+ if req.pitch_adjust != 0.0:
249
+ semitones = req.pitch_adjust * 0.24
250
+ audio_np = pyrb.pitch_shift(audio_np, SAMPLE_RATE, semitones)
251
+
252
+ vol_factor = req.volume / 75.0
253
+ audio_np = audio_np * vol_factor
254
+
255
+ wav_bytes_out = numpy_to_wav_bytes(audio_np, SAMPLE_RATE)
256
+
257
+ return Response(content=wav_bytes_out, media_type="audio/wav")
258
+
259
+ except Exception as e:
260
+ logger.exception("TTS generation failed")
261
+ return JSONResponse(
262
+ status_code=500,
263
+ content={
264
+ "error": "Audio generation failed",
265
+ "error_code": "GENERATION_FAILED",
266
+ "details": str(e)
267
+ }
268
+ )
269
+ finally:
270
+ for f in temp_files:
271
+ try:
272
+ os.unlink(f)
273
+ except OSError:
274
+ pass
275
+
276
+
277
+ @app.get("/", response_class=HTMLResponse)
278
+ async def root():
279
+ html_path = Path(__file__).parent / "index.html"
280
+ if html_path.exists():
281
+ return HTMLResponse(content=html_path.read_text())
282
+ return HTMLResponse(content="""
283
+ <html>
284
+ <head><title>Chatterbox TTS Engine</title></head>
285
+ <body style="font-family: sans-serif; max-width: 800px; margin: 40px auto; padding: 20px;">
286
+ <h1>Chatterbox TTS Engine</h1>
287
+ <p>VoxLibris-compatible TTS engine powered by <a href="https://github.com/resemble-ai/chatterbox">Chatterbox TTS</a>.</p>
288
+ <h2>Endpoints</h2>
289
+ <ul>
290
+ <li><code>POST /GetEngineDetails</code> - Get engine capabilities</li>
291
+ <li><code>POST /ConvertTextToSpeech</code> - Convert text to speech</li>
292
+ <li><code>GET /health</code> - Health check</li>
293
+ </ul>
294
+ <h2>Features</h2>
295
+ <ul>
296
+ <li>Voice cloning from reference audio</li>
297
+ <li>Emotion-driven expressiveness via exaggeration control</li>
298
+ <li>Speed and pitch adjustment via pyrubberband</li>
299
+ </ul>
300
+ </body>
301
+ </html>
302
+ """)
303
+
304
+
305
+ @app.get("/health")
306
+ async def health():
307
+ return {"status": "ok", "model_loaded": tts_model is not None}
308
+
309
+
310
+ if __name__ == "__main__":
311
+ import uvicorn
312
+ uvicorn.run(app, host="0.0.0.0", port=7860)
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ chatterbox-tts>=0.1.0
2
+ torch>=2.0.0
3
+ torchaudio>=2.0.0
4
+ fastapi>=0.104.0
5
+ uvicorn[standard]>=0.24.0
6
+ numpy
7
+ pydantic>=2.0.0
8
+ pyrubberband>=0.3.0
9
+ soundfile>=0.12.0