Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

DeepCritical / docs /implementation /TTS_MODAL_IMPLEMENTATION.md

Joseph Pollack

adds youtube video

e6507c6 16 days ago

preview code

raw

history blame

4.74 kB

TTS Modal GPU Implementation

Overview

The TTS (Text-to-Speech) service uses Kokoro 82M model running on Modal's GPU infrastructure. This document describes the implementation details and configuration.

Implementation Details

Modal GPU Function Pattern

The implementation follows Modal's recommended pattern for GPU functions:

Module-Level Function Definition: Modal functions must be defined at module level and attached to an app instance
Lazy Initialization: The function is set up on first use via _setup_modal_function()
GPU Configuration: GPU type is set at function definition time (requires app restart to change)

Key Files

src/services/tts_modal.py - Modal GPU executor for Kokoro TTS
src/services/audio_processing.py - Unified audio service wrapper
src/utils/config.py - Configuration settings
src/app.py - UI integration with settings accordion

Configuration Options

All TTS configuration is available in src/utils/config.py:

tts_model: str = "hexgrad/Kokoro-82M"  # Model ID
tts_voice: str = "af_heart"  # Voice ID
tts_speed: float = 1.0  # Speed multiplier (0.5-2.0)
tts_gpu: str = "T4"  # GPU type (T4, A10, A100, etc.)
tts_timeout: int = 60  # Timeout in seconds
enable_audio_output: bool = True  # Enable/disable TTS

UI Configuration

TTS settings are available in the Settings accordion:

Voice Dropdown: Select from 20+ Kokoro voices (af_heart, af_bella, am_michael, etc.)
Speed Slider: Adjust speech speed (0.5x to 2.0x)
GPU Dropdown: Select GPU type (T4, A10, A100, L4, L40S) - visible only if Modal credentials configured
Enable Audio Output: Toggle TTS generation

Modal Function Implementation

The Modal GPU function is defined as:

@app.function(
    image=tts_image,  # Image with Kokoro dependencies
    gpu="T4",  # GPU type (from settings.tts_gpu)
    timeout=60,  # Timeout (from settings.tts_timeout)
)
def kokoro_tts_function(text: str, voice: str, speed: float) -> tuple[int, np.ndarray]:
    """Modal GPU function for Kokoro TTS."""
    from kokoro import KModel, KPipeline
    import torch
    
    model = KModel().to("cuda").eval()
    pipeline = KPipeline(lang_code=voice[0])
    pack = pipeline.load_voice(voice)
    
    for _, ps, _ in pipeline(text, voice, speed):
        ref_s = pack[len(ps) - 1]
        audio = model(ps, ref_s, speed)
        return (24000, audio.numpy())

Usage Flow

User submits query with audio output enabled
Research agent processes query and generates text response
AudioService.generate_audio_output() is called with:
- Response text
- Voice (from UI dropdown or settings default)
- Speed (from UI slider or settings default)
TTSService.synthesize_async() calls Modal GPU function
Modal executes Kokoro TTS on GPU
Audio tuple (sample_rate, audio_array) is returned
Audio is displayed in Gradio Audio component

Dependencies

Installed via uv add --optional:

gradio-client>=1.0.0 - For STT/OCR API calls
soundfile>=0.12.0 - For audio file I/O
Pillow>=10.0.0 - For image processing

Kokoro is installed in Modal image from source:

git+https://github.com/hexgrad/kokoro.git

GPU Types

Modal supports various GPU types:

T4: Cheapest, good for testing (default)
A10: Good balance of cost/performance
A100: Fastest, most expensive
L4: NVIDIA L4 GPU
L40S: NVIDIA L40S GPU

Note: GPU type is set at function definition time. Changes to settings.tts_gpu require app restart.

Error Handling

If Modal credentials not configured: TTS service unavailable (graceful degradation)
If Kokoro import fails: ConfigurationError raised
If synthesis fails: Returns None, logs warning, continues without audio
If GPU unavailable: Modal will queue or fail with clear error message

Configuration Connection

Settings → Implementation: settings.tts_voice, settings.tts_speed used as defaults
UI → Implementation: UI dropdowns/sliders passed to research_agent() function
Implementation → Modal: Voice and speed passed to Modal GPU function
GPU Configuration: Set at function definition time (requires restart to change)

Testing

To test TTS:

Ensure Modal credentials configured (MODAL_TOKEN_ID, MODAL_TOKEN_SECRET)
Enable audio output in settings
Submit a query
Check audio output component for generated speech

References

Kokoro TTS Space - Reference implementation
Modal GPU Documentation - Modal GPU usage
Kokoro GitHub - Source code