Spaces:

Agents-MCP-Hackathon
/

pdf_explainer

Sleeping

File size: 8,556 Bytes

# Enhanced Chatterbox TTS API

This package contains the modular components of the Enhanced Chatterbox TTS API with GPU-accelerated processing, intelligent text chunking, and server-side audio concatenation.

## Features

- **GPU-Accelerated Processing**: Leverage server GPU for parallel chunk processing
- **Intelligent Text Chunking**: Smart text splitting that respects sentence and paragraph boundaries
- **Server-Side Concatenation**: Seamless audio merging with fade effects and silence control
- **Voice Cloning**: Optional voice prompt for personalized speech generation
- **Multiple Response Formats**: Streaming audio, complete files, or JSON with base64 encoding
- **Scalable Architecture**: Handles texts of any length efficiently

## Structure

```
api/
├── __init__.py          # Package initialization and exports
├── config.py            # Modal app configuration and container image setup
├── models.py            # Pydantic request/response models (enhanced with full-text support)
├── audio_utils.py       # Audio processing utilities and helper functions
├── text_processing.py   # Server-side text chunking and audio concatenation
├── tts_service.py       # Main TTS service class with all API endpoints
├── test_api.py          # Comprehensive API testing suite
└── README.md           # This file
```

## Components

### config.py

- Modal app configuration with GPU support (A10G)
- Container image setup with required dependencies
- Centralized configuration management
- Memory snapshot and scaling configuration

### models.py

- `TTSRequest`: Standard request model for TTS generation
- `FullTextTTSRequest`: Enhanced request model for full-text processing with chunking parameters
- `TTSResponse`: Standard response model for JSON endpoints
- `FullTextTTSResponse`: Enhanced response with processing information
- `HealthResponse`: Response model for health checks
- All models include proper type hints, validation, and documentation

### text_processing.py

- `TextChunker`: Intelligent server-side text chunking with configurable parameters
- `AudioConcatenator`: Server-side audio concatenation with fade effects and silence control
- Optimized for GPU processing and large text handling

### audio_utils.py

- `AudioUtils`: Static utility class for audio operations
- Buffer management for audio data
- Temporary file handling with automatic cleanup
- Reusable audio processing functions

### tts_service.py

- `ChatterboxTTSService`: Main service class with all endpoints
- GPU-accelerated TTS model loading and inference
- Multiple API endpoints for different use cases
- Comprehensive error handling and validation
- New full-text processing endpoints with parallel chunk processing

### test_api.py

- Comprehensive testing suite for all API endpoints
- Tests for basic generation, voice cloning, file uploads, and full-text processing
- Performance benchmarking and validation scripts

## API Endpoints

### Standard Endpoints

#### `GET /health`

Health check endpoint to verify model status and service availability.

```bash
curl -X GET "YOUR-ENDPOINT/health"
```

#### `POST /generate_audio`

Generate speech audio from text with optional voice cloning (streaming response).

```bash
curl -X POST "YOUR-ENDPOINT/generate_audio" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world!"}' \
  --output output.wav
```

#### `POST /generate_json`

Generate speech and return JSON with base64 encoded audio.

```bash
curl -X POST "YOUR-ENDPOINT/generate_json" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world!"}'
```

#### `POST /generate_with_file`

Generate speech with file upload for voice cloning.

```bash
curl -X POST "YOUR-ENDPOINT/generate_with_file" \
  -F "text=Hello world!" \
  -F "voice_prompt=@voice_sample.wav" \
  --output output.wav
```

### Enhanced Full-Text Endpoints

#### `POST /generate_full_text_audio`

🆕 Generate speech from full text with server-side chunking and parallel processing.

```bash
curl -X POST "YOUR-ENDPOINT/generate_full_text_audio" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your very long text here...",
    "max_chunk_size": 800,
    "silence_duration": 0.5,
    "fade_duration": 0.1,
    "overlap_sentences": 0
  }' \
  --output full_text_output.wav
```

#### `POST /generate_full_text_json`

🆕 Generate speech from full text and return JSON with processing information.

```bash
curl -X POST "YOUR-ENDPOINT/generate_full_text_json" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your very long text here...",
    "max_chunk_size": 800,
    "silence_duration": 0.5
  }'
```

### Legacy Endpoints

#### `POST /generate`

Legacy endpoint for backward compatibility.

```bash
curl -X POST "YOUR-ENDPOINT/generate?prompt=Hello%20world!" \
  --output legacy_output.wav
```

## Request Parameters

### FullTextTTSRequest Parameters

- **`text`** (required): The text to convert to speech (any length)
- **`voice_prompt_base64`** (optional): Base64 encoded voice prompt for cloning
- **`max_chunk_size`** (optional, default: 800): Maximum characters per chunk
- **`silence_duration`** (optional, default: 0.5): Silence between chunks in seconds
- **`fade_duration`** (optional, default: 0.1): Fade in/out duration in seconds
- **`overlap_sentences`** (optional, default: 0): Sentences to overlap between chunks

## Response Headers

Enhanced endpoints include additional headers with processing information:

- **`X-Audio-Duration`**: Duration of generated audio in seconds
- **`X-Chunks-Processed`**: Number of text chunks processed
- **`X-Total-Characters`**: Total characters in the input text

## Usage

```python
from api import app, ChatterboxTTSService

# The app is automatically configured and ready to deploy
# The service class contains all the endpoints
```

### Python Client Example

```python
import requests

# Generate audio from long text
response = requests.post(
    "YOUR-ENDPOINT/generate_full_text_audio",
    json={
        "text": "Your long document text here...",
        "max_chunk_size": 800,
        "silence_duration": 0.5
    }
)

if response.status_code == 200:
    with open("output.wav", "wb") as f:
        f.write(response.content)
    print("Audio generated successfully!")
```

## Performance Characteristics

### Standard Processing

- **Text Length**: Up to ~1000 characters optimal
- **Processing Time**: ~2-5 seconds per request
- **Use Case**: Short texts, real-time applications

### Full-Text Processing

- **Text Length**: Unlimited (automatically chunked)
- **Processing Time**: ~5-15 seconds for long documents
- **Parallelization**: Up to 4 concurrent chunks
- **Use Case**: Documents, articles, books

## Deployment

```bash
# Deploy the enhanced API
modal deploy tts_service.py

# Test the deployment
python test_api.py
```

````

## Benefits of Enhanced Architecture

1. **GPU Acceleration**: Server-side processing leverages GPU resources for faster inference
2. **Intelligent Chunking**: Smart text splitting that preserves sentence integrity
3. **Parallel Processing**: Multiple chunks processed simultaneously for better performance
4. **Scalability**: Handles texts of any length without client-side limitations
5. **Separation of Concerns**: Each file has a specific responsibility
6. **Maintainability**: Easier to update and modify individual components
7. **Testability**: Components can be tested in isolation
8. **Reusability**: Components can be imported and used in other projects
9. **Readability**: Smaller files are easier to understand and navigate

## Testing

Run the comprehensive test suite:

```bash
cd api/
python test_api.py
````

The test suite includes:

- Health check validation
- Basic text-to-speech generation
- JSON response testing
- Voice cloning functionality
- File upload testing
- Full-text processing validation
- Performance benchmarking

## Environment Variables

Set these environment variables for testing:

```bash
HEALTH_ENDPOINT=https://your-modal-endpoint.modal.run/health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint.modal.run/generate_audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint.modal.run/generate_with_file
GENERATE_ENDPOINT=https://your-modal-endpoint.modal.run/generate
FULL_TEXT_TTS_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_audio
FULL_TEXT_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_json
```