Spaces:
Sleeping
Sleeping
| # Enhanced Chatterbox TTS API | |
| This package contains the modular components of the Enhanced Chatterbox TTS API with GPU-accelerated processing, intelligent text chunking, and server-side audio concatenation. | |
| ## Features | |
| - **GPU-Accelerated Processing**: Leverage server GPU for parallel chunk processing | |
| - **Intelligent Text Chunking**: Smart text splitting that respects sentence and paragraph boundaries | |
| - **Server-Side Concatenation**: Seamless audio merging with fade effects and silence control | |
| - **Voice Cloning**: Optional voice prompt for personalized speech generation | |
| - **Multiple Response Formats**: Streaming audio, complete files, or JSON with base64 encoding | |
| - **Scalable Architecture**: Handles texts of any length efficiently | |
| ## Structure | |
| ``` | |
| api/ | |
| βββ __init__.py # Package initialization and exports | |
| βββ config.py # Modal app configuration and container image setup | |
| βββ models.py # Pydantic request/response models (enhanced with full-text support) | |
| βββ audio_utils.py # Audio processing utilities and helper functions | |
| βββ text_processing.py # Server-side text chunking and audio concatenation | |
| βββ tts_service.py # Main TTS service class with all API endpoints | |
| βββ test_api.py # Comprehensive API testing suite | |
| βββ README.md # This file | |
| ``` | |
| ## Components | |
| ### config.py | |
| - Modal app configuration with GPU support (A10G) | |
| - Container image setup with required dependencies | |
| - Centralized configuration management | |
| - Memory snapshot and scaling configuration | |
| ### models.py | |
| - `TTSRequest`: Standard request model for TTS generation | |
| - `FullTextTTSRequest`: Enhanced request model for full-text processing with chunking parameters | |
| - `TTSResponse`: Standard response model for JSON endpoints | |
| - `FullTextTTSResponse`: Enhanced response with processing information | |
| - `HealthResponse`: Response model for health checks | |
| - All models include proper type hints, validation, and documentation | |
| ### text_processing.py | |
| - `TextChunker`: Intelligent server-side text chunking with configurable parameters | |
| - `AudioConcatenator`: Server-side audio concatenation with fade effects and silence control | |
| - Optimized for GPU processing and large text handling | |
| ### audio_utils.py | |
| - `AudioUtils`: Static utility class for audio operations | |
| - Buffer management for audio data | |
| - Temporary file handling with automatic cleanup | |
| - Reusable audio processing functions | |
| ### tts_service.py | |
| - `ChatterboxTTSService`: Main service class with all endpoints | |
| - GPU-accelerated TTS model loading and inference | |
| - Multiple API endpoints for different use cases | |
| - Comprehensive error handling and validation | |
| - New full-text processing endpoints with parallel chunk processing | |
| ### test_api.py | |
| - Comprehensive testing suite for all API endpoints | |
| - Tests for basic generation, voice cloning, file uploads, and full-text processing | |
| - Performance benchmarking and validation scripts | |
| ## API Endpoints | |
| ### Standard Endpoints | |
| #### `GET /health` | |
| Health check endpoint to verify model status and service availability. | |
| ```bash | |
| curl -X GET "YOUR-ENDPOINT/health" | |
| ``` | |
| #### `POST /generate_audio` | |
| Generate speech audio from text with optional voice cloning (streaming response). | |
| ```bash | |
| curl -X POST "YOUR-ENDPOINT/generate_audio" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Hello world!"}' \ | |
| --output output.wav | |
| ``` | |
| #### `POST /generate_json` | |
| Generate speech and return JSON with base64 encoded audio. | |
| ```bash | |
| curl -X POST "YOUR-ENDPOINT/generate_json" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"text": "Hello world!"}' | |
| ``` | |
| #### `POST /generate_with_file` | |
| Generate speech with file upload for voice cloning. | |
| ```bash | |
| curl -X POST "YOUR-ENDPOINT/generate_with_file" \ | |
| -F "text=Hello world!" \ | |
| -F "voice_prompt=@voice_sample.wav" \ | |
| --output output.wav | |
| ``` | |
| ### Enhanced Full-Text Endpoints | |
| #### `POST /generate_full_text_audio` | |
| π Generate speech from full text with server-side chunking and parallel processing. | |
| ```bash | |
| curl -X POST "YOUR-ENDPOINT/generate_full_text_audio" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "text": "Your very long text here...", | |
| "max_chunk_size": 800, | |
| "silence_duration": 0.5, | |
| "fade_duration": 0.1, | |
| "overlap_sentences": 0 | |
| }' \ | |
| --output full_text_output.wav | |
| ``` | |
| #### `POST /generate_full_text_json` | |
| π Generate speech from full text and return JSON with processing information. | |
| ```bash | |
| curl -X POST "YOUR-ENDPOINT/generate_full_text_json" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "text": "Your very long text here...", | |
| "max_chunk_size": 800, | |
| "silence_duration": 0.5 | |
| }' | |
| ``` | |
| ### Legacy Endpoints | |
| #### `POST /generate` | |
| Legacy endpoint for backward compatibility. | |
| ```bash | |
| curl -X POST "YOUR-ENDPOINT/generate?prompt=Hello%20world!" \ | |
| --output legacy_output.wav | |
| ``` | |
| ## Request Parameters | |
| ### FullTextTTSRequest Parameters | |
| - **`text`** (required): The text to convert to speech (any length) | |
| - **`voice_prompt_base64`** (optional): Base64 encoded voice prompt for cloning | |
| - **`max_chunk_size`** (optional, default: 800): Maximum characters per chunk | |
| - **`silence_duration`** (optional, default: 0.5): Silence between chunks in seconds | |
| - **`fade_duration`** (optional, default: 0.1): Fade in/out duration in seconds | |
| - **`overlap_sentences`** (optional, default: 0): Sentences to overlap between chunks | |
| ## Response Headers | |
| Enhanced endpoints include additional headers with processing information: | |
| - **`X-Audio-Duration`**: Duration of generated audio in seconds | |
| - **`X-Chunks-Processed`**: Number of text chunks processed | |
| - **`X-Total-Characters`**: Total characters in the input text | |
| ## Usage | |
| ```python | |
| from api import app, ChatterboxTTSService | |
| # The app is automatically configured and ready to deploy | |
| # The service class contains all the endpoints | |
| ``` | |
| ### Python Client Example | |
| ```python | |
| import requests | |
| # Generate audio from long text | |
| response = requests.post( | |
| "YOUR-ENDPOINT/generate_full_text_audio", | |
| json={ | |
| "text": "Your long document text here...", | |
| "max_chunk_size": 800, | |
| "silence_duration": 0.5 | |
| } | |
| ) | |
| if response.status_code == 200: | |
| with open("output.wav", "wb") as f: | |
| f.write(response.content) | |
| print("Audio generated successfully!") | |
| ``` | |
| ## Performance Characteristics | |
| ### Standard Processing | |
| - **Text Length**: Up to ~1000 characters optimal | |
| - **Processing Time**: ~2-5 seconds per request | |
| - **Use Case**: Short texts, real-time applications | |
| ### Full-Text Processing | |
| - **Text Length**: Unlimited (automatically chunked) | |
| - **Processing Time**: ~5-15 seconds for long documents | |
| - **Parallelization**: Up to 4 concurrent chunks | |
| - **Use Case**: Documents, articles, books | |
| ## Deployment | |
| ```bash | |
| # Deploy the enhanced API | |
| modal deploy tts_service.py | |
| # Test the deployment | |
| python test_api.py | |
| ``` | |
| ```` | |
| ## Benefits of Enhanced Architecture | |
| 1. **GPU Acceleration**: Server-side processing leverages GPU resources for faster inference | |
| 2. **Intelligent Chunking**: Smart text splitting that preserves sentence integrity | |
| 3. **Parallel Processing**: Multiple chunks processed simultaneously for better performance | |
| 4. **Scalability**: Handles texts of any length without client-side limitations | |
| 5. **Separation of Concerns**: Each file has a specific responsibility | |
| 6. **Maintainability**: Easier to update and modify individual components | |
| 7. **Testability**: Components can be tested in isolation | |
| 8. **Reusability**: Components can be imported and used in other projects | |
| 9. **Readability**: Smaller files are easier to understand and navigate | |
| ## Testing | |
| Run the comprehensive test suite: | |
| ```bash | |
| cd api/ | |
| python test_api.py | |
| ```` | |
| The test suite includes: | |
| - Health check validation | |
| - Basic text-to-speech generation | |
| - JSON response testing | |
| - Voice cloning functionality | |
| - File upload testing | |
| - Full-text processing validation | |
| - Performance benchmarking | |
| ## Environment Variables | |
| Set these environment variables for testing: | |
| ```bash | |
| HEALTH_ENDPOINT=https://your-modal-endpoint.modal.run/health | |
| GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint.modal.run/generate_audio | |
| GENERATE_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_json | |
| GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint.modal.run/generate_with_file | |
| GENERATE_ENDPOINT=https://your-modal-endpoint.modal.run/generate | |
| FULL_TEXT_TTS_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_audio | |
| FULL_TEXT_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_json | |
| ``` | |