pdf_explainer / api /README.md
spagestic's picture
api updated
d1c4aa1

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Enhanced Chatterbox TTS API

This package contains the modular components of the Enhanced Chatterbox TTS API with GPU-accelerated processing, intelligent text chunking, and server-side audio concatenation.

Features

  • GPU-Accelerated Processing: Leverage server GPU for parallel chunk processing
  • Intelligent Text Chunking: Smart text splitting that respects sentence and paragraph boundaries
  • Server-Side Concatenation: Seamless audio merging with fade effects and silence control
  • Voice Cloning: Optional voice prompt for personalized speech generation
  • Multiple Response Formats: Streaming audio, complete files, or JSON with base64 encoding
  • Scalable Architecture: Handles texts of any length efficiently

Structure

api/
β”œβ”€β”€ __init__.py          # Package initialization and exports
β”œβ”€β”€ config.py            # Modal app configuration and container image setup
β”œβ”€β”€ models.py            # Pydantic request/response models (enhanced with full-text support)
β”œβ”€β”€ audio_utils.py       # Audio processing utilities and helper functions
β”œβ”€β”€ text_processing.py   # Server-side text chunking and audio concatenation
β”œβ”€β”€ tts_service.py       # Main TTS service class with all API endpoints
β”œβ”€β”€ test_api.py          # Comprehensive API testing suite
└── README.md           # This file

Components

config.py

  • Modal app configuration with GPU support (A10G)
  • Container image setup with required dependencies
  • Centralized configuration management
  • Memory snapshot and scaling configuration

models.py

  • TTSRequest: Standard request model for TTS generation
  • FullTextTTSRequest: Enhanced request model for full-text processing with chunking parameters
  • TTSResponse: Standard response model for JSON endpoints
  • FullTextTTSResponse: Enhanced response with processing information
  • HealthResponse: Response model for health checks
  • All models include proper type hints, validation, and documentation

text_processing.py

  • TextChunker: Intelligent server-side text chunking with configurable parameters
  • AudioConcatenator: Server-side audio concatenation with fade effects and silence control
  • Optimized for GPU processing and large text handling

audio_utils.py

  • AudioUtils: Static utility class for audio operations
  • Buffer management for audio data
  • Temporary file handling with automatic cleanup
  • Reusable audio processing functions

tts_service.py

  • ChatterboxTTSService: Main service class with all endpoints
  • GPU-accelerated TTS model loading and inference
  • Multiple API endpoints for different use cases
  • Comprehensive error handling and validation
  • New full-text processing endpoints with parallel chunk processing

test_api.py

  • Comprehensive testing suite for all API endpoints
  • Tests for basic generation, voice cloning, file uploads, and full-text processing
  • Performance benchmarking and validation scripts

API Endpoints

Standard Endpoints

GET /health

Health check endpoint to verify model status and service availability.

curl -X GET "YOUR-ENDPOINT/health"

POST /generate_audio

Generate speech audio from text with optional voice cloning (streaming response).

curl -X POST "YOUR-ENDPOINT/generate_audio" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world!"}' \
  --output output.wav

POST /generate_json

Generate speech and return JSON with base64 encoded audio.

curl -X POST "YOUR-ENDPOINT/generate_json" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world!"}'

POST /generate_with_file

Generate speech with file upload for voice cloning.

curl -X POST "YOUR-ENDPOINT/generate_with_file" \
  -F "text=Hello world!" \
  -F "voice_prompt=@voice_sample.wav" \
  --output output.wav

Enhanced Full-Text Endpoints

POST /generate_full_text_audio

πŸ†• Generate speech from full text with server-side chunking and parallel processing.

curl -X POST "YOUR-ENDPOINT/generate_full_text_audio" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your very long text here...",
    "max_chunk_size": 800,
    "silence_duration": 0.5,
    "fade_duration": 0.1,
    "overlap_sentences": 0
  }' \
  --output full_text_output.wav

POST /generate_full_text_json

πŸ†• Generate speech from full text and return JSON with processing information.

curl -X POST "YOUR-ENDPOINT/generate_full_text_json" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your very long text here...",
    "max_chunk_size": 800,
    "silence_duration": 0.5
  }'

Legacy Endpoints

POST /generate

Legacy endpoint for backward compatibility.

curl -X POST "YOUR-ENDPOINT/generate?prompt=Hello%20world!" \
  --output legacy_output.wav

Request Parameters

FullTextTTSRequest Parameters

  • text (required): The text to convert to speech (any length)
  • voice_prompt_base64 (optional): Base64 encoded voice prompt for cloning
  • max_chunk_size (optional, default: 800): Maximum characters per chunk
  • silence_duration (optional, default: 0.5): Silence between chunks in seconds
  • fade_duration (optional, default: 0.1): Fade in/out duration in seconds
  • overlap_sentences (optional, default: 0): Sentences to overlap between chunks

Response Headers

Enhanced endpoints include additional headers with processing information:

  • X-Audio-Duration: Duration of generated audio in seconds
  • X-Chunks-Processed: Number of text chunks processed
  • X-Total-Characters: Total characters in the input text

Usage

from api import app, ChatterboxTTSService

# The app is automatically configured and ready to deploy
# The service class contains all the endpoints

Python Client Example

import requests

# Generate audio from long text
response = requests.post(
    "YOUR-ENDPOINT/generate_full_text_audio",
    json={
        "text": "Your long document text here...",
        "max_chunk_size": 800,
        "silence_duration": 0.5
    }
)

if response.status_code == 200:
    with open("output.wav", "wb") as f:
        f.write(response.content)
    print("Audio generated successfully!")

Performance Characteristics

Standard Processing

  • Text Length: Up to ~1000 characters optimal
  • Processing Time: ~2-5 seconds per request
  • Use Case: Short texts, real-time applications

Full-Text Processing

  • Text Length: Unlimited (automatically chunked)
  • Processing Time: ~5-15 seconds for long documents
  • Parallelization: Up to 4 concurrent chunks
  • Use Case: Documents, articles, books

Deployment

# Deploy the enhanced API
modal deploy tts_service.py

# Test the deployment
python test_api.py

## Benefits of Enhanced Architecture

1. **GPU Acceleration**: Server-side processing leverages GPU resources for faster inference
2. **Intelligent Chunking**: Smart text splitting that preserves sentence integrity
3. **Parallel Processing**: Multiple chunks processed simultaneously for better performance
4. **Scalability**: Handles texts of any length without client-side limitations
5. **Separation of Concerns**: Each file has a specific responsibility
6. **Maintainability**: Easier to update and modify individual components
7. **Testability**: Components can be tested in isolation
8. **Reusability**: Components can be imported and used in other projects
9. **Readability**: Smaller files are easier to understand and navigate

## Testing

Run the comprehensive test suite:

```bash
cd api/
python test_api.py

The test suite includes:

  • Health check validation
  • Basic text-to-speech generation
  • JSON response testing
  • Voice cloning functionality
  • File upload testing
  • Full-text processing validation
  • Performance benchmarking

Environment Variables

Set these environment variables for testing:

HEALTH_ENDPOINT=https://your-modal-endpoint.modal.run/health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint.modal.run/generate_audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint.modal.run/generate_with_file
GENERATE_ENDPOINT=https://your-modal-endpoint.modal.run/generate
FULL_TEXT_TTS_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_audio
FULL_TEXT_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_json