Spaces:

Agents-MCP-Hackathon
/

pdf_explainer

Sleeping

App Files Files Community

pdf_explainer / api /README.md

spagestic

api updated

d1c4aa1 7 months ago

preview code

raw

history blame contribute delete

8.56 kB

	# Enhanced Chatterbox TTS API

	This package contains the modular components of the Enhanced Chatterbox TTS API with GPU-accelerated processing, intelligent text chunking, and server-side audio concatenation.

	## Features

	- GPU-Accelerated Processing: Leverage server GPU for parallel chunk processing
	- Intelligent Text Chunking: Smart text splitting that respects sentence and paragraph boundaries
	- Server-Side Concatenation: Seamless audio merging with fade effects and silence control
	- Voice Cloning: Optional voice prompt for personalized speech generation
	- Multiple Response Formats: Streaming audio, complete files, or JSON with base64 encoding
	- Scalable Architecture: Handles texts of any length efficiently

	## Structure

	```
	api/
	├── __init__.py # Package initialization and exports
	├── config.py # Modal app configuration and container image setup
	├── models.py # Pydantic request/response models (enhanced with full-text support)
	├── audio_utils.py # Audio processing utilities and helper functions
	├── text_processing.py # Server-side text chunking and audio concatenation
	├── tts_service.py # Main TTS service class with all API endpoints
	├── test_api.py # Comprehensive API testing suite
	└── README.md # This file
	```

	## Components

	### config.py

	- Modal app configuration with GPU support (A10G)
	- Container image setup with required dependencies
	- Centralized configuration management
	- Memory snapshot and scaling configuration

	### models.py

	- `TTSRequest`: Standard request model for TTS generation
	- `FullTextTTSRequest`: Enhanced request model for full-text processing with chunking parameters
	- `TTSResponse`: Standard response model for JSON endpoints
	- `FullTextTTSResponse`: Enhanced response with processing information
	- `HealthResponse`: Response model for health checks
	- All models include proper type hints, validation, and documentation

	### text_processing.py

	- `TextChunker`: Intelligent server-side text chunking with configurable parameters
	- `AudioConcatenator`: Server-side audio concatenation with fade effects and silence control
	- Optimized for GPU processing and large text handling

	### audio_utils.py

	- `AudioUtils`: Static utility class for audio operations
	- Buffer management for audio data
	- Temporary file handling with automatic cleanup
	- Reusable audio processing functions

	### tts_service.py

	- `ChatterboxTTSService`: Main service class with all endpoints
	- GPU-accelerated TTS model loading and inference
	- Multiple API endpoints for different use cases
	- Comprehensive error handling and validation
	- New full-text processing endpoints with parallel chunk processing

	### test_api.py

	- Comprehensive testing suite for all API endpoints
	- Tests for basic generation, voice cloning, file uploads, and full-text processing
	- Performance benchmarking and validation scripts

	## API Endpoints

	### Standard Endpoints

	#### `GET /health`

	Health check endpoint to verify model status and service availability.

	```bash
	curl -X GET "YOUR-ENDPOINT/health"
	```

	#### `POST /generate_audio`

	Generate speech audio from text with optional voice cloning (streaming response).

	```bash
	curl -X POST "YOUR-ENDPOINT/generate_audio" \
	-H "Content-Type: application/json" \
	-d '{"text": "Hello world!"}' \
	--output output.wav
	```

	#### `POST /generate_json`

	Generate speech and return JSON with base64 encoded audio.

	```bash
	curl -X POST "YOUR-ENDPOINT/generate_json" \
	-H "Content-Type: application/json" \
	-d '{"text": "Hello world!"}'
	```

	#### `POST /generate_with_file`

	Generate speech with file upload for voice cloning.

	```bash
	curl -X POST "YOUR-ENDPOINT/generate_with_file" \
	-F "text=Hello world!" \
	-F "voice_prompt=@voice_sample.wav" \
	--output output.wav
	```

	### Enhanced Full-Text Endpoints

	#### `POST /generate_full_text_audio`

	🆕 Generate speech from full text with server-side chunking and parallel processing.

	```bash
	curl -X POST "YOUR-ENDPOINT/generate_full_text_audio" \
	-H "Content-Type: application/json" \
	-d '{
	"text": "Your very long text here...",
	"max_chunk_size": 800,
	"silence_duration": 0.5,
	"fade_duration": 0.1,
	"overlap_sentences": 0
	}' \
	--output full_text_output.wav
	```

	#### `POST /generate_full_text_json`

	🆕 Generate speech from full text and return JSON with processing information.

	```bash
	curl -X POST "YOUR-ENDPOINT/generate_full_text_json" \
	-H "Content-Type: application/json" \
	-d '{
	"text": "Your very long text here...",
	"max_chunk_size": 800,
	"silence_duration": 0.5
	}'
	```

	### Legacy Endpoints

	#### `POST /generate`

	Legacy endpoint for backward compatibility.

	```bash
	curl -X POST "YOUR-ENDPOINT/generate?prompt=Hello%20world!" \
	--output legacy_output.wav
	```

	## Request Parameters

	### FullTextTTSRequest Parameters

	- `text` (required): The text to convert to speech (any length)
	- `voice_prompt_base64` (optional): Base64 encoded voice prompt for cloning
	- `max_chunk_size` (optional, default: 800): Maximum characters per chunk
	- `silence_duration` (optional, default: 0.5): Silence between chunks in seconds
	- `fade_duration` (optional, default: 0.1): Fade in/out duration in seconds
	- `overlap_sentences` (optional, default: 0): Sentences to overlap between chunks

	## Response Headers

	Enhanced endpoints include additional headers with processing information:

	- `X-Audio-Duration`: Duration of generated audio in seconds
	- `X-Chunks-Processed`: Number of text chunks processed
	- `X-Total-Characters`: Total characters in the input text

	## Usage

	```python
	from api import app, ChatterboxTTSService

	# The app is automatically configured and ready to deploy
	# The service class contains all the endpoints
	```

	### Python Client Example

	```python
	import requests

	# Generate audio from long text
	response = requests.post(
	"YOUR-ENDPOINT/generate_full_text_audio",
	json={
	"text": "Your long document text here...",
	"max_chunk_size": 800,
	"silence_duration": 0.5
	}
	)

	if response.status_code == 200:
	with open("output.wav", "wb") as f:
	f.write(response.content)
	print("Audio generated successfully!")
	```

	## Performance Characteristics

	### Standard Processing

	- Text Length: Up to ~1000 characters optimal
	- Processing Time: ~2-5 seconds per request
	- Use Case: Short texts, real-time applications

	### Full-Text Processing

	- Text Length: Unlimited (automatically chunked)
	- Processing Time: ~5-15 seconds for long documents
	- Parallelization: Up to 4 concurrent chunks
	- Use Case: Documents, articles, books

	## Deployment

	```bash
	# Deploy the enhanced API
	modal deploy tts_service.py

	# Test the deployment
	python test_api.py
	```

	````

	## Benefits of Enhanced Architecture

	1. GPU Acceleration: Server-side processing leverages GPU resources for faster inference
	2. Intelligent Chunking: Smart text splitting that preserves sentence integrity
	3. Parallel Processing: Multiple chunks processed simultaneously for better performance
	4. Scalability: Handles texts of any length without client-side limitations
	5. Separation of Concerns: Each file has a specific responsibility
	6. Maintainability: Easier to update and modify individual components
	7. Testability: Components can be tested in isolation
	8. Reusability: Components can be imported and used in other projects
	9. Readability: Smaller files are easier to understand and navigate

	## Testing

	Run the comprehensive test suite:

	```bash
	cd api/
	python test_api.py
	````

	The test suite includes:

	- Health check validation
	- Basic text-to-speech generation
	- JSON response testing
	- Voice cloning functionality
	- File upload testing
	- Full-text processing validation
	- Performance benchmarking

	## Environment Variables

	Set these environment variables for testing:

	```bash
	HEALTH_ENDPOINT=https://your-modal-endpoint.modal.run/health
	GENERATE_AUDIO_ENDPOINT=https://your-modal-endpoint.modal.run/generate_audio
	GENERATE_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_json
	GENERATE_WITH_FILE_ENDPOINT=https://your-modal-endpoint.modal.run/generate_with_file
	GENERATE_ENDPOINT=https://your-modal-endpoint.modal.run/generate
	FULL_TEXT_TTS_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_audio
	FULL_TEXT_JSON_ENDPOINT=https://your-modal-endpoint.modal.run/generate_full_text_json
	```