Spaces:

Agents-MCP-Hackathon
/

pdf_explainer

Sleeping

App Files Files Community

pdf_explainer / api /README.md

spagestic

api updated

d1c4aa1 7 months ago

preview code

raw

history blame contribute delete

8.56 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

Enhanced Chatterbox TTS API

This package contains the modular components of the Enhanced Chatterbox TTS API with GPU-accelerated processing, intelligent text chunking, and server-side audio concatenation.

Features

GPU-Accelerated Processing: Leverage server GPU for parallel chunk processing
Intelligent Text Chunking: Smart text splitting that respects sentence and paragraph boundaries
Server-Side Concatenation: Seamless audio merging with fade effects and silence control
Voice Cloning: Optional voice prompt for personalized speech generation
Multiple Response Formats: Streaming audio, complete files, or JSON with base64 encoding
Scalable Architecture: Handles texts of any length efficiently

Structure

api/
├── __init__.py          # Package initialization and exports
├── config.py            # Modal app configuration and container image setup
├── models.py            # Pydantic request/response models (enhanced with full-text support)
├── audio_utils.py       # Audio processing utilities and helper functions
├── text_processing.py   # Server-side text chunking and audio concatenation
├── tts_service.py       # Main TTS service class with all API endpoints
├── test_api.py          # Comprehensive API testing suite
└── README.md           # This file

Components

config.py

Modal app configuration with GPU support (A10G)
Container image setup with required dependencies
Centralized configuration management
Memory snapshot and scaling configuration

models.py

TTSRequest: Standard request model for TTS generation
FullTextTTSRequest: Enhanced request model for full-text processing with chunking parameters
TTSResponse: Standard response model for JSON endpoints
FullTextTTSResponse: Enhanced response with processing information
HealthResponse: Response model for health checks
All models include proper type hints, validation, and documentation

text_processing.py

TextChunker: Intelligent server-side text chunking with configurable parameters
AudioConcatenator: Server-side audio concatenation with fade effects and silence control
Optimized for GPU processing and large text handling

audio_utils.py

AudioUtils: Static utility class for audio operations
Buffer management for audio data
Temporary file handling with automatic cleanup
Reusable audio processing functions

tts_service.py

ChatterboxTTSService: Main service class with all endpoints
GPU-accelerated TTS model loading and inference
Multiple API endpoints for different use cases
Comprehensive error handling and validation
New full-text processing endpoints with parallel chunk processing

test_api.py

Comprehensive testing suite for all API endpoints
Tests for basic generation, voice cloning, file uploads, and full-text processing
Performance benchmarking and validation scripts

API Endpoints

Standard Endpoints

`GET /health`

Health check endpoint to verify model status and service availability.

curl -X GET "YOUR-ENDPOINT/health"

`POST /generate_audio`

Generate speech audio from text with optional voice cloning (streaming response).

curl -X POST "YOUR-ENDPOINT/generate_audio" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world!"}' \
  --output output.wav

`POST /generate_json`

Generate speech and return JSON with base64 encoded audio.

curl -X POST "YOUR-ENDPOINT/generate_json" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world!"}'

`POST /generate_with_file`

Generate speech with file upload for voice cloning.

curl -X POST "YOUR-ENDPOINT/generate_with_file" \
  -F "text=Hello world!" \
  -F "voice_prompt=@voice_sample.wav" \
  --output output.wav

Enhanced Full-Text Endpoints

`POST /generate_full_text_audio`

🆕 Generate speech from full text with server-side chunking and parallel processing.

curl -X POST "YOUR-ENDPOINT/generate_full_text_audio" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your very long text here...",
    "max_chunk_size": 800,
    "silence_duration": 0.5,
    "fade_duration": 0.1,
    "overlap_sentences": 0
  }' \
  --output full_text_output.wav

`POST /generate_full_text_json`

🆕 Generate speech from full text and return JSON with processing information.

curl -X POST "YOUR-ENDPOINT/generate_full_text_json" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Your very long text here...",
    "max_chunk_size": 800,
    "silence_duration": 0.5
  }'

Legacy Endpoints

`POST /generate`

Legacy endpoint for backward compatibility.

curl -X POST "YOUR-ENDPOINT/generate?prompt=Hello%20world!" \
  --output legacy_output.wav

Request Parameters

FullTextTTSRequest Parameters

text (required): The text to convert to speech (any length)
voice_prompt_base64 (optional): Base64 encoded voice prompt for cloning
max_chunk_size (optional, default: 800): Maximum characters per chunk
silence_duration (optional, default: 0.5): Silence between chunks in seconds
fade_duration (optional, default: 0.1): Fade in/out duration in seconds
overlap_sentences (optional, default: 0): Sentences to overlap between chunks

Response Headers

Enhanced endpoints include additional headers with processing information:

X-Audio-Duration: Duration of generated audio in seconds
X-Chunks-Processed: Number of text chunks processed
X-Total-Characters: Total characters in the input text

Usage

from api import app, ChatterboxTTSService

# The app is automatically configured and ready to deploy
# The service class contains all the endpoints

Python Client Example

import requests

# Generate audio from long text
response = requests.post(
    "YOUR-ENDPOINT/generate_full_text_audio",
    json={
        "text": "Your long document text here...",
        "max_chunk_size": 800,
        "silence_duration": 0.5
    }
)

if response.status_code == 200:
    with open("output.wav", "wb") as f:
        f.write(response.content)
    print("Audio generated successfully!")

Performance Characteristics

Standard Processing

Text Length: Up to ~1000 characters optimal
Processing Time: ~2-5 seconds per request
Use Case: Short texts, real-time applications

Full-Text Processing

Text Length: Unlimited (automatically chunked)
Processing Time: ~5-15 seconds for long documents
Parallelization: Up to 4 concurrent chunks
Use Case: Documents, articles, books

Deployment

# Deploy the enhanced API
modal deploy tts_service.py

# Test the deployment
python test_api.py


## Benefits of Enhanced Architecture

1. **GPU Acceleration**: Server-side processing leverages GPU resources for faster inference
2. **Intelligent Chunking**: Smart text splitting that preserves sentence integrity
3. **Parallel Processing**: Multiple chunks processed simultaneously for better performance
4. **Scalability**: Handles texts of any length without client-side limitations
5. **Separation of Concerns**: Each file has a specific responsibility
6. **Maintainability**: Easier to update and modify individual components
7. **Testability**: Components can be tested in isolation
8. **Reusability**: Components can be imported and used in other projects
9. **Readability**: Smaller files are easier to understand and navigate

## Testing

Run the comprehensive test suite:

```bash
cd api/
python test_api.py

The test suite includes:

Health check validation
Basic text-to-speech generation
JSON response testing
Voice cloning functionality
File upload testing
Full-text processing validation
Performance benchmarking

Environment Variables

Set these environment variables for testing:

HEALTH_ENDPOINT=https://your-modal-endpoint.modal.run/health
GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint.modal.run/generate_audio
GENERATE_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_json
GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint.modal.run/generate_with_file
GENERATE_ENDPOINT=https://your-modal-endpoint.modal.run/generate
FULL_TEXT_TTS_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_audio
FULL_TEXT_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_json