Spaces:

jameszokah
/

nuviq

Runtime error

File size: 13,482 Bytes

36e2bdd

# Chatterbox TTS FastAPI

This API provides a **FastAPI**-based web service for the Chatterbox TTS text-to-speech system, designed to be compatible with OpenAI's TTS API format.

## Features

- **OpenAI-compatible API**: Uses similar endpoint structure to OpenAI's text-to-speech API
- **FastAPI Performance**: High-performance async API with automatic documentation
- **Type Safety**: Full Pydantic validation for requests and responses
- **Interactive Documentation**: Automatic Swagger UI and ReDoc generation
- **Automatic text chunking**: Automatically breaks long text into manageable chunks to handle character limits
- **Voice cloning**: Uses the pre-specified `voice-sample.mp3` file for voice conditioning
- **Async Support**: Non-blocking request handling with better concurrency
- **Error handling**: Comprehensive error handling with appropriate HTTP status codes
- **Health monitoring**: Health check endpoint for monitoring service status
- **Environment-based configuration**: Fully configurable via environment variables
- **Docker support**: Ready for containerized deployment

## Setup

### Prerequisites

1. Ensure you have the Chatterbox TTS package installed:

   ```bash
   pip install chatterbox-tts
   ```

2. Install FastAPI and other required dependencies:

   ```bash
   pip install fastapi uvicorn[standard] torchaudio requests python-dotenv
   ```

3. Ensure you have a `voice-sample.mp3` file in the project root directory for voice conditioning

### Configuration

Copy the example environment file and customize it:

```bash
cp .env.example .env
nano .env  # Edit with your preferred settings
```

Key environment variables:

- `PORT=4123` - API server port
- `EXAGGERATION=0.5` - Default emotion intensity (0.25-2.0)
- `CFG_WEIGHT=0.5` - Default pace control (0.0-1.0)
- `TEMPERATURE=0.8` - Default sampling temperature (0.05-5.0)
- `VOICE_SAMPLE_PATH=./voice-sample.mp3` - Path to voice sample file
- `DEVICE=auto` - Device selection (auto/cuda/mps/cpu)

See `.env.example` for all available options.

### Running the API

Start the API server:

```bash
# Method 1: Direct uvicorn (recommended for development)
uvicorn app.main:app --host 0.0.0.0 --port 4123

# Method 2: Using the main script
python main.py

# Method 3: With auto-reload for development
uvicorn app.main:app --host 0.0.0.0 --port 4123 --reload
```

The server will:

- Automatically detect the best available device (CUDA, MPS, or CPU)
- Load the Chatterbox TTS model asynchronously
- Start the FastAPI server on `http://localhost:4123` (or your configured port)
- Provide interactive documentation at `/docs` and `/redoc`

### API Documentation

Once running, you can access:

- **Interactive API Docs (Swagger UI)**: http://localhost:4123/docs
- **Alternative Documentation (ReDoc)**: http://localhost:4123/redoc
- **OpenAPI Schema**: http://localhost:4123/openapi.json

## API Endpoints

### 1. Text-to-Speech Generation

**POST** `/v1/audio/speech`

Generate speech from text using the Chatterbox TTS model.

**Request Body (Pydantic Model):**

```json
{
  "input": "Text to convert to speech",
  "voice": "alloy", // Ignored - uses voice-sample.mp3
  "response_format": "wav", // Ignored - always returns WAV
  "speed": 1.0, // Ignored - use model's built-in parameters
  "exaggeration": 0.7, // Optional - override default (0.25-2.0)
  "cfg_weight": 0.4, // Optional - override default (0.0-1.0)
  "temperature": 0.9 // Optional - override default (0.05-5.0)
}
```

**Validation:**

- `input`: Required, 1-3000 characters, automatically trimmed
- `exaggeration`: Optional, 0.25-2.0 range validation
- `cfg_weight`: Optional, 0.0-1.0 range validation
- `temperature`: Optional, 0.05-5.0 range validation

**Response:**

- Content-Type: `audio/wav`
- Binary audio data in WAV format via StreamingResponse

**Example:**

```bash
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello, this is a test of the text to speech system."}' \
  --output speech.wav
```

**With custom parameters:**

```bash
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Dramatic speech!", "exaggeration": 1.2, "cfg_weight": 0.3}' \
  --output dramatic.wav
```

### 2. Health Check

**GET** `/health`

Check if the API is running and the model is loaded.

**Response (HealthResponse model):**

```json
{
  "status": "healthy",
  "model_loaded": true,
  "device": "cuda",
  "config": {
    "max_chunk_length": 280,
    "max_total_length": 3000,
    "voice_sample_path": "./voice-sample.mp3",
    "default_exaggeration": 0.5,
    "default_cfg_weight": 0.5,
    "default_temperature": 0.8
  }
}
```

### 3. List Models

**GET** `/v1/models`

List available models (OpenAI API compatibility).

**Response (ModelsResponse model):**

```json
{
  "object": "list",
  "data": [
    {
      "id": "chatterbox-tts-1",
      "object": "model",
      "created": 1677649963,
      "owned_by": "resemble-ai"
    }
  ]
}
```

### 4. Configuration Info

**GET** `/config`

Get current configuration (useful for debugging).

**Response (ConfigResponse model):**

```json
{
  "server": {
    "host": "0.0.0.0",
    "port": 4123
  },
  "model": {
    "device": "cuda",
    "voice_sample_path": "./voice-sample.mp3",
    "model_cache_dir": "./models"
  },
  "defaults": {
    "exaggeration": 0.5,
    "cfg_weight": 0.5,
    "temperature": 0.8,
    "max_chunk_length": 280,
    "max_total_length": 3000
  }
}
```

### 5. API Documentation Endpoints

**GET** `/docs` - Interactive Swagger UI documentation  
**GET** `/redoc` - Alternative ReDoc documentation  
**GET** `/openapi.json` - OpenAPI schema specification

## Text Processing

### Automatic Chunking

The API automatically handles long text inputs by:

1. **Character limit**: Splits text longer than the configured chunk size (default: 280 characters)
2. **Sentence preservation**: Attempts to split at sentence boundaries (`.`, `!`, `?`)
3. **Fallback splitting**: If sentences are too long, splits at commas, semicolons, or other natural breaks
4. **Audio concatenation**: Seamlessly combines audio from multiple chunks

### Maximum Limits

- **Soft limit**: Configurable characters per chunk (default: 280)
- **Hard limit**: Configurable total characters (default: 3000)
- **Automatic processing**: No manual intervention required

## Error Handling

FastAPI provides enhanced error handling with automatic validation:

- **422 Unprocessable Entity**: Invalid input validation (Pydantic errors)
- **400 Bad Request**: Business logic errors (text too long, etc.)
- **500 Internal Server Error**: Model or processing errors

**Error Response Format:**

```json
{
  "error": {
    "message": "Missing required field: 'input'",
    "type": "invalid_request_error"
  }
}
```

**Validation Error Example:**

```json
{
  "detail": [
    {
      "type": "greater_equal",
      "loc": ["body", "exaggeration"],
      "msg": "Input should be greater than or equal to 0.25",
      "input": 0.1
    }
  ]
}
```

## Testing

Use the enhanced test script to verify the API functionality:

```bash
python tests/test_api.py
```

The test script will:

- Test health check endpoint
- Test models endpoint
- Test API documentation endpoints (new!)
- Generate speech for various text lengths
- Test custom parameter validation
- Test error handling with validation
- Save generated audio files as `test_output_*.wav`

## Configuration

You can configure the API through environment variables or by modifying `.env.example`:

```bash
# Server Configuration
PORT=4123
HOST=0.0.0.0

# TTS Model Settings
EXAGGERATION=0.5          # Emotion intensity (0.25-2.0)
CFG_WEIGHT=0.5            # Pace control (0.0-1.0)
TEMPERATURE=0.8           # Sampling temperature (0.05-5.0)

# Text Processing
MAX_CHUNK_LENGTH=280      # Characters per chunk
MAX_TOTAL_LENGTH=3000     # Total character limit

# Voice and Model Settings
VOICE_SAMPLE_PATH=./voice-sample.mp3
DEVICE=auto               # auto/cuda/mps/cpu
MODEL_CACHE_DIR=./models
```

### Parameter Effects

**Exaggeration (0.25-2.0):**

- `0.3-0.4`: Very neutral, professional
- `0.5`: Neutral (default)
- `0.7-0.8`: More expressive
- `1.0+`: Very dramatic (may be unstable)

**CFG Weight (0.0-1.0):**

- `0.2-0.3`: Faster speech
- `0.5`: Balanced (default)
- `0.7-0.8`: Slower, more deliberate

**Temperature (0.05-5.0):**

- `0.4-0.6`: More consistent
- `0.8`: Balanced (default)
- `1.0+`: More creative/random

## Docker Deployment

For Docker deployment, see [DOCKER_README.md](DOCKER_README.md) for complete instructions.

**Quick start with Docker Compose:**

```bash
cp .env.example .env  # Customize as needed
docker compose up -d
```

**Quick start with Docker:**

```bash
docker build -t chatterbox-tts .
docker run -d -p 4123:4123 \
  -v ./voice-sample.mp3:/app/voice-sample.mp3:ro \
  -e EXAGGERATION=0.7 \
  chatterbox-tts
```

## Performance Notes

**FastAPI Benefits:**

- **Async performance**: Better handling of concurrent requests
- **Faster JSON serialization**: ~25% faster than Flask
- **Type validation**: Prevents invalid requests at the API level
- **Auto documentation**: No manual API doc maintenance

**Hardware Recommendations:**

- **Model loading**: The model is loaded once at startup (can take 30-60 seconds)
- **First request**: May be slower due to initial model warm-up
- **Subsequent requests**: Should be faster due to model caching
- **Memory usage**: Varies by device (GPU recommended for best performance)
- **Concurrent requests**: FastAPI async support allows better multi-request handling

## Integration Examples

### Python with requests

```python
import requests

# Basic request
response = requests.post(
    "http://localhost:4123/v1/audio/speech",
    json={"input": "Hello world!"}
)

with open("output.wav", "wb") as f:
    f.write(response.content)

# With custom parameters and validation
response = requests.post(
    "http://localhost:4123/v1/audio/speech",
    json={
        "input": "Exciting news!",
        "exaggeration": 0.8,
        "cfg_weight": 0.4,
        "temperature": 1.0
    }
)

# Handle validation errors
if response.status_code == 422:
    print("Validation error:", response.json())
```

### JavaScript/Node.js

```javascript
const response = await fetch('http://localhost:4123/v1/audio/speech', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    input: 'Hello world!',
    exaggeration: 0.7,
  }),
});

if (response.status === 422) {
  const error = await response.json();
  console.log('Validation error:', error);
} else {
  const audioBuffer = await response.arrayBuffer();
  // Save or play the audio buffer
}
```

### cURL

```bash
# Basic usage
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Your text here"}' \
  --output output.wav

# With custom parameters
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Dramatic text!", "exaggeration": 1.0, "cfg_weight": 0.3}' \
  --output dramatic.wav

# Test the interactive documentation
curl http://localhost:4123/docs
```

## Development Features

### FastAPI Development Tools

- **Auto-reload**: Use `--reload` flag for development
- **Interactive testing**: Use `/docs` for live API testing
- **Type hints**: Full IDE support with Pydantic models
- **Validation**: Automatic request/response validation
- **OpenAPI**: Machine-readable API specification

### Development Mode

```bash
# Start with auto-reload
uvicorn app.main:app --host 0.0.0.0 --port 4123 --reload

# Or with verbose logging
uvicorn app.main:app --host 0.0.0.0 --port 4123 --log-level debug
```

## Troubleshooting

### Common Issues

1. **Model not loading**: Ensure Chatterbox TTS is properly installed
2. **Voice sample missing**: Verify `voice-sample.mp3` exists at the configured path
3. **CUDA out of memory**: Try using CPU device (`DEVICE=cpu`)
4. **Slow performance**: GPU recommended; ensure CUDA/MPS is available
5. **Port conflicts**: Change `PORT` environment variable to an available port
6. **Uvicorn not found**: Install with `pip install uvicorn[standard]`

### FastAPI Specific Issues

**Startup Issues:**

```bash
# Check if uvicorn is installed
uvicorn --version

# Run with verbose logging
uvicorn app.main:app --host 0.0.0.0 --port 4123 --log-level debug

# Alternative startup method
python main.py
```

**Validation Errors:**

Visit `/docs` to see the interactive API documentation and test your requests.

### Checking Configuration

```bash
# Check if API is running
curl http://localhost:4123/health

# View current configuration
curl http://localhost:4123/config

# Check API documentation
curl http://localhost:4123/openapi.json

# Test with simple text
curl -X POST http://localhost:4123/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Test"}' \
  --output test.wav
```

## Migration from Flask

If you're migrating from the previous Flask version:

1. **Dependencies**: Update to `fastapi` and `uvicorn` instead of `flask`
2. **Startup**: Use `uvicorn app.main:app` instead of `python api.py`
3. **Documentation**: Visit `/docs` for interactive API testing
4. **Validation**: Error responses now use HTTP 422 for validation errors
5. **Performance**: Expect 25-40% better performance for JSON responses

All existing API endpoints and request/response formats remain compatible.