Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.3.0
Enhanced Chatterbox TTS API
This package contains the modular components of the Enhanced Chatterbox TTS API with GPU-accelerated processing, intelligent text chunking, and server-side audio concatenation.
Features
- GPU-Accelerated Processing: Leverage server GPU for parallel chunk processing
- Intelligent Text Chunking: Smart text splitting that respects sentence and paragraph boundaries
- Server-Side Concatenation: Seamless audio merging with fade effects and silence control
- Voice Cloning: Optional voice prompt for personalized speech generation
- Multiple Response Formats: Streaming audio, complete files, or JSON with base64 encoding
- Scalable Architecture: Handles texts of any length efficiently
Structure
api/
βββ __init__.py # Package initialization and exports
βββ config.py # Modal app configuration and container image setup
βββ models.py # Pydantic request/response models (enhanced with full-text support)
βββ audio_utils.py # Audio processing utilities and helper functions
βββ text_processing.py # Server-side text chunking and audio concatenation
βββ tts_service.py # Main TTS service class with all API endpoints
βββ test_api.py # Comprehensive API testing suite
βββ README.md # This file
Components
config.py
- Modal app configuration with GPU support (A10G)
- Container image setup with required dependencies
- Centralized configuration management
- Memory snapshot and scaling configuration
models.py
TTSRequest: Standard request model for TTS generationFullTextTTSRequest: Enhanced request model for full-text processing with chunking parametersTTSResponse: Standard response model for JSON endpointsFullTextTTSResponse: Enhanced response with processing informationHealthResponse: Response model for health checks- All models include proper type hints, validation, and documentation
text_processing.py
TextChunker: Intelligent server-side text chunking with configurable parametersAudioConcatenator: Server-side audio concatenation with fade effects and silence control- Optimized for GPU processing and large text handling
audio_utils.py
AudioUtils: Static utility class for audio operations- Buffer management for audio data
- Temporary file handling with automatic cleanup
- Reusable audio processing functions
tts_service.py
ChatterboxTTSService: Main service class with all endpoints- GPU-accelerated TTS model loading and inference
- Multiple API endpoints for different use cases
- Comprehensive error handling and validation
- New full-text processing endpoints with parallel chunk processing
test_api.py
- Comprehensive testing suite for all API endpoints
- Tests for basic generation, voice cloning, file uploads, and full-text processing
- Performance benchmarking and validation scripts
API Endpoints
Standard Endpoints
GET /health
Health check endpoint to verify model status and service availability.
curl -X GET "YOUR-ENDPOINT/health"
POST /generate_audio
Generate speech audio from text with optional voice cloning (streaming response).
curl -X POST "YOUR-ENDPOINT/generate_audio" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world!"}' \
--output output.wav
POST /generate_json
Generate speech and return JSON with base64 encoded audio.
curl -X POST "YOUR-ENDPOINT/generate_json" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world!"}'
POST /generate_with_file
Generate speech with file upload for voice cloning.
curl -X POST "YOUR-ENDPOINT/generate_with_file" \
-F "text=Hello world!" \
-F "voice_prompt=@voice_sample.wav" \
--output output.wav
Enhanced Full-Text Endpoints
POST /generate_full_text_audio
π Generate speech from full text with server-side chunking and parallel processing.
curl -X POST "YOUR-ENDPOINT/generate_full_text_audio" \
-H "Content-Type: application/json" \
-d '{
"text": "Your very long text here...",
"max_chunk_size": 800,
"silence_duration": 0.5,
"fade_duration": 0.1,
"overlap_sentences": 0
}' \
--output full_text_output.wav
POST /generate_full_text_json
π Generate speech from full text and return JSON with processing information.
curl -X POST "YOUR-ENDPOINT/generate_full_text_json" \
-H "Content-Type: application/json" \
-d '{
"text": "Your very long text here...",
"max_chunk_size": 800,
"silence_duration": 0.5
}'
Legacy Endpoints
POST /generate
Legacy endpoint for backward compatibility.
curl -X POST "YOUR-ENDPOINT/generate?prompt=Hello%20world!" \
--output legacy_output.wav
Request Parameters
FullTextTTSRequest Parameters
text(required): The text to convert to speech (any length)voice_prompt_base64(optional): Base64 encoded voice prompt for cloningmax_chunk_size(optional, default: 800): Maximum characters per chunksilence_duration(optional, default: 0.5): Silence between chunks in secondsfade_duration(optional, default: 0.1): Fade in/out duration in secondsoverlap_sentences(optional, default: 0): Sentences to overlap between chunks
Response Headers
Enhanced endpoints include additional headers with processing information:
X-Audio-Duration: Duration of generated audio in secondsX-Chunks-Processed: Number of text chunks processedX-Total-Characters: Total characters in the input text
Usage
from api import app, ChatterboxTTSService
# The app is automatically configured and ready to deploy
# The service class contains all the endpoints
Python Client Example
import requests
# Generate audio from long text
response = requests.post(
"YOUR-ENDPOINT/generate_full_text_audio",
json={
"text": "Your long document text here...",
"max_chunk_size": 800,
"silence_duration": 0.5
}
)
if response.status_code == 200:
with open("output.wav", "wb") as f:
f.write(response.content)
print("Audio generated successfully!")
Performance Characteristics
Standard Processing
- Text Length: Up to ~1000 characters optimal
- Processing Time: ~2-5 seconds per request
- Use Case: Short texts, real-time applications
Full-Text Processing
- Text Length: Unlimited (automatically chunked)
- Processing Time: ~5-15 seconds for long documents
- Parallelization: Up to 4 concurrent chunks
- Use Case: Documents, articles, books
Deployment
# Deploy the enhanced API
modal deploy tts_service.py
# Test the deployment
python test_api.py
## Benefits of Enhanced Architecture
1. **GPU Acceleration**: Server-side processing leverages GPU resources for faster inference
2. **Intelligent Chunking**: Smart text splitting that preserves sentence integrity
3. **Parallel Processing**: Multiple chunks processed simultaneously for better performance
4. **Scalability**: Handles texts of any length without client-side limitations
5. **Separation of Concerns**: Each file has a specific responsibility
6. **Maintainability**: Easier to update and modify individual components
7. **Testability**: Components can be tested in isolation
8. **Reusability**: Components can be imported and used in other projects
9. **Readability**: Smaller files are easier to understand and navigate
## Testing
Run the comprehensive test suite:
```bash
cd api/
python test_api.py
The test suite includes:
- Health check validation
- Basic text-to-speech generation
- JSON response testing
- Voice cloning functionality
- File upload testing
- Full-text processing validation
- Performance benchmarking
Environment Variables
Set these environment variables for testing:
HEALTH_ENDPOINT=https://your-modal-endpoint.modal.run/health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint.modal.run/generate_audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint.modal.run/generate_with_file
GENERATE_ENDPOINT=https://your-modal-endpoint.modal.run/generate
FULL_TEXT_TTS_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_audio
FULL_TEXT_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_json