File size: 13,482 Bytes
36e2bdd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
# Chatterbox TTS FastAPI

This API provides a **FastAPI**-based web service for the Chatterbox TTS text-to-speech system, designed to be compatible with OpenAI's TTS API format.

## Features

- **OpenAI-compatible API**: Uses similar endpoint structure to OpenAI's text-to-speech API
- **FastAPI Performance**: High-performance async API with automatic documentation
- **Type Safety**: Full Pydantic validation for requests and responses
- **Interactive Documentation**: Automatic Swagger UI and ReDoc generation
- **Automatic text chunking**: Automatically breaks long text into manageable chunks to handle character limits
- **Voice cloning**: Uses the pre-specified `voice-sample.mp3` file for voice conditioning
- **Async Support**: Non-blocking request handling with better concurrency
- **Error handling**: Comprehensive error handling with appropriate HTTP status codes
- **Health monitoring**: Health check endpoint for monitoring service status
- **Environment-based configuration**: Fully configurable via environment variables
- **Docker support**: Ready for containerized deployment

## Setup

### Prerequisites

1. Ensure you have the Chatterbox TTS package installed:

   ```bash
   pip install chatterbox-tts
   ```

2. Install FastAPI and other required dependencies:

   ```bash
   pip install fastapi uvicorn[standard] torchaudio requests python-dotenv
   ```

3. Ensure you have a `voice-sample.mp3` file in the project root directory for voice conditioning

### Configuration

Copy the example environment file and customize it:

```bash
cp .env.example .env
nano .env  # Edit with your preferred settings
```

Key environment variables:

- `PORT=4123` - API server port
- `EXAGGERATION=0.5` - Default emotion intensity (0.25-2.0)
- `CFG_WEIGHT=0.5` - Default pace control (0.0-1.0)
- `TEMPERATURE=0.8` - Default sampling temperature (0.05-5.0)
- `VOICE_SAMPLE_PATH=./voice-sample.mp3` - Path to voice sample file
- `DEVICE=auto` - Device selection (auto/cuda/mps/cpu)

See `.env.example` for all available options.

### Running the API

Start the API server:

```bash
# Method 1: Direct uvicorn (recommended for development)
uvicorn app.main:app --host 0.0.0.0 --port 4123

# Method 2: Using the main script
python main.py

# Method 3: With auto-reload for development
uvicorn app.main:app --host 0.0.0.0 --port 4123 --reload
```

The server will:

- Automatically detect the best available device (CUDA, MPS, or CPU)
- Load the Chatterbox TTS model asynchronously
- Start the FastAPI server on `http://localhost:4123` (or your configured port)
- Provide interactive documentation at `/docs` and `/redoc`

### API Documentation

Once running, you can access:

- **Interactive API Docs (Swagger UI)**: http://localhost:4123/docs
- **Alternative Documentation (ReDoc)**: http://localhost:4123/redoc
- **OpenAPI Schema**: http://localhost:4123/openapi.json

## API Endpoints

### 1. Text-to-Speech Generation

**POST** `/v1/audio/speech`

Generate speech from text using the Chatterbox TTS model.

**Request Body (Pydantic Model):**

```json
{
  "input": "Text to convert to speech",
  "voice": "alloy", // Ignored - uses voice-sample.mp3
  "response_format": "wav", // Ignored - always returns WAV
  "speed": 1.0, // Ignored - use model's built-in parameters
  "exaggeration": 0.7, // Optional - override default (0.25-2.0)
  "cfg_weight": 0.4, // Optional - override default (0.0-1.0)
  "temperature": 0.9 // Optional - override default (0.05-5.0)
}
```

**Validation:**

- `input`: Required, 1-3000 characters, automatically trimmed
- `exaggeration`: Optional, 0.25-2.0 range validation
- `cfg_weight`: Optional, 0.0-1.0 range validation
- `temperature`: Optional, 0.05-5.0 range validation

**Response:**

- Content-Type: `audio/wav`
- Binary audio data in WAV format via StreamingResponse

**Example:**

```bash
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, this is a test of the text to speech system."}' \
  --output speech.wav
```

**With custom parameters:**

```bash
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Dramatic speech!", "exaggeration": 1.2, "cfg_weight": 0.3}' \
  --output dramatic.wav
```

### 2. Health Check

**GET** `/health`

Check if the API is running and the model is loaded.

**Response (HealthResponse model):**

```json
{
  "status": "healthy",
  "model_loaded": true,
  "device": "cuda",
  "config": {
    "max_chunk_length": 280,
    "max_total_length": 3000,
    "voice_sample_path": "./voice-sample.mp3",
    "default_exaggeration": 0.5,
    "default_cfg_weight": 0.5,
    "default_temperature": 0.8
  }
}
```

### 3. List Models

**GET** `/v1/models`

List available models (OpenAI API compatibility).

**Response (ModelsResponse model):**

```json
{
  "object": "list",
  "data": [
    {
      "id": "chatterbox-tts-1",
      "object": "model",
      "created": 1677649963,
      "owned_by": "resemble-ai"
    }
  ]
}
```

### 4. Configuration Info

**GET** `/config`

Get current configuration (useful for debugging).

**Response (ConfigResponse model):**

```json
{
  "server": {
    "host": "0.0.0.0",
    "port": 4123
  },
  "model": {
    "device": "cuda",
    "voice_sample_path": "./voice-sample.mp3",
    "model_cache_dir": "./models"
  },
  "defaults": {
    "exaggeration": 0.5,
    "cfg_weight": 0.5,
    "temperature": 0.8,
    "max_chunk_length": 280,
    "max_total_length": 3000
  }
}
```

### 5. API Documentation Endpoints

**GET** `/docs` - Interactive Swagger UI documentation  
**GET** `/redoc` - Alternative ReDoc documentation  
**GET** `/openapi.json` - OpenAPI schema specification

## Text Processing

### Automatic Chunking

The API automatically handles long text inputs by:

1. **Character limit**: Splits text longer than the configured chunk size (default: 280 characters)
2. **Sentence preservation**: Attempts to split at sentence boundaries (`.`, `!`, `?`)
3. **Fallback splitting**: If sentences are too long, splits at commas, semicolons, or other natural breaks
4. **Audio concatenation**: Seamlessly combines audio from multiple chunks

### Maximum Limits

- **Soft limit**: Configurable characters per chunk (default: 280)
- **Hard limit**: Configurable total characters (default: 3000)
- **Automatic processing**: No manual intervention required

## Error Handling

FastAPI provides enhanced error handling with automatic validation:

- **422 Unprocessable Entity**: Invalid input validation (Pydantic errors)
- **400 Bad Request**: Business logic errors (text too long, etc.)
- **500 Internal Server Error**: Model or processing errors

**Error Response Format:**

```json
{
  "error": {
    "message": "Missing required field: 'input'",
    "type": "invalid_request_error"
  }
}
```

**Validation Error Example:**

```json
{
  "detail": [
    {
      "type": "greater_equal",
      "loc": ["body", "exaggeration"],
      "msg": "Input should be greater than or equal to 0.25",
      "input": 0.1
    }
  ]
}
```

## Testing

Use the enhanced test script to verify the API functionality:

```bash
python tests/test_api.py
```

The test script will:

- Test health check endpoint
- Test models endpoint
- Test API documentation endpoints (new!)
- Generate speech for various text lengths
- Test custom parameter validation
- Test error handling with validation
- Save generated audio files as `test_output_*.wav`

## Configuration

You can configure the API through environment variables or by modifying `.env.example`:

```bash
# Server Configuration
PORT=4123
HOST=0.0.0.0

# TTS Model Settings
EXAGGERATION=0.5          # Emotion intensity (0.25-2.0)
CFG_WEIGHT=0.5            # Pace control (0.0-1.0)
TEMPERATURE=0.8           # Sampling temperature (0.05-5.0)

# Text Processing
MAX_CHUNK_LENGTH=280      # Characters per chunk
MAX_TOTAL_LENGTH=3000     # Total character limit

# Voice and Model Settings
VOICE_SAMPLE_PATH=./voice-sample.mp3
DEVICE=auto               # auto/cuda/mps/cpu
MODEL_CACHE_DIR=./models
```

### Parameter Effects

**Exaggeration (0.25-2.0):**

- `0.3-0.4`: Very neutral, professional
- `0.5`: Neutral (default)
- `0.7-0.8`: More expressive
- `1.0+`: Very dramatic (may be unstable)

**CFG Weight (0.0-1.0):**

- `0.2-0.3`: Faster speech
- `0.5`: Balanced (default)
- `0.7-0.8`: Slower, more deliberate

**Temperature (0.05-5.0):**

- `0.4-0.6`: More consistent
- `0.8`: Balanced (default)
- `1.0+`: More creative/random

## Docker Deployment

For Docker deployment, see [DOCKER_README.md](DOCKER_README.md) for complete instructions.

**Quick start with Docker Compose:**

```bash
cp .env.example .env  # Customize as needed
docker compose up -d
```

**Quick start with Docker:**

```bash
docker build -t chatterbox-tts .
docker run -d -p 4123:4123 \
  -v ./voice-sample.mp3:/app/voice-sample.mp3:ro \
  -e EXAGGERATION=0.7 \
  chatterbox-tts
```

## Performance Notes

**FastAPI Benefits:**

- **Async performance**: Better handling of concurrent requests
- **Faster JSON serialization**: ~25% faster than Flask
- **Type validation**: Prevents invalid requests at the API level
- **Auto documentation**: No manual API doc maintenance

**Hardware Recommendations:**

- **Model loading**: The model is loaded once at startup (can take 30-60 seconds)
- **First request**: May be slower due to initial model warm-up
- **Subsequent requests**: Should be faster due to model caching
- **Memory usage**: Varies by device (GPU recommended for best performance)
- **Concurrent requests**: FastAPI async support allows better multi-request handling

## Integration Examples

### Python with requests

```python
import requests

# Basic request
response = requests.post(
    "http://localhost:4123/v1/audio/speech",
    json={"input": "Hello world!"}
)

with open("output.wav", "wb") as f:
    f.write(response.content)

# With custom parameters and validation
response = requests.post(
    "http://localhost:4123/v1/audio/speech",
    json={
        "input": "Exciting news!",
        "exaggeration": 0.8,
        "cfg_weight": 0.4,
        "temperature": 1.0
    }
)

# Handle validation errors
if response.status_code == 422:
    print("Validation error:", response.json())
```

### JavaScript/Node.js

```javascript
const response = await fetch('http://localhost:4123/v1/audio/speech', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    input: 'Hello world!',
    exaggeration: 0.7,
  }),
});

if (response.status === 422) {
  const error = await response.json();
  console.log('Validation error:', error);
} else {
  const audioBuffer = await response.arrayBuffer();
  // Save or play the audio buffer
}
```

### cURL

```bash
# Basic usage
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Your text here"}' \
  --output output.wav

# With custom parameters
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Dramatic text!", "exaggeration": 1.0, "cfg_weight": 0.3}' \
  --output dramatic.wav

# Test the interactive documentation
curl http://localhost:4123/docs
```

## Development Features

### FastAPI Development Tools

- **Auto-reload**: Use `--reload` flag for development
- **Interactive testing**: Use `/docs` for live API testing
- **Type hints**: Full IDE support with Pydantic models
- **Validation**: Automatic request/response validation
- **OpenAPI**: Machine-readable API specification

### Development Mode

```bash
# Start with auto-reload
uvicorn app.main:app --host 0.0.0.0 --port 4123 --reload

# Or with verbose logging
uvicorn app.main:app --host 0.0.0.0 --port 4123 --log-level debug
```

## Troubleshooting

### Common Issues

1. **Model not loading**: Ensure Chatterbox TTS is properly installed
2. **Voice sample missing**: Verify `voice-sample.mp3` exists at the configured path
3. **CUDA out of memory**: Try using CPU device (`DEVICE=cpu`)
4. **Slow performance**: GPU recommended; ensure CUDA/MPS is available
5. **Port conflicts**: Change `PORT` environment variable to an available port
6. **Uvicorn not found**: Install with `pip install uvicorn[standard]`

### FastAPI Specific Issues

**Startup Issues:**

```bash
# Check if uvicorn is installed
uvicorn --version

# Run with verbose logging
uvicorn app.main:app --host 0.0.0.0 --port 4123 --log-level debug

# Alternative startup method
python main.py
```

**Validation Errors:**

Visit `/docs` to see the interactive API documentation and test your requests.

### Checking Configuration

```bash
# Check if API is running
curl http://localhost:4123/health

# View current configuration
curl http://localhost:4123/config

# Check API documentation
curl http://localhost:4123/openapi.json

# Test with simple text
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Test"}' \
  --output test.wav
```

## Migration from Flask

If you're migrating from the previous Flask version:

1. **Dependencies**: Update to `fastapi` and `uvicorn` instead of `flask`
2. **Startup**: Use `uvicorn app.main:app` instead of `python api.py`
3. **Documentation**: Visit `/docs` for interactive API testing
4. **Validation**: Error responses now use HTTP 422 for validation errors
5. **Performance**: Expect 25-40% better performance for JSON responses

All existing API endpoints and request/response formats remain compatible.