DeepCritical / docs /implementation /TTS_MODAL_IMPLEMENTATION.md
Joseph Pollack
adds youtube video
e6507c6
|
raw
history blame
4.74 kB

TTS Modal GPU Implementation

Overview

The TTS (Text-to-Speech) service uses Kokoro 82M model running on Modal's GPU infrastructure. This document describes the implementation details and configuration.

Implementation Details

Modal GPU Function Pattern

The implementation follows Modal's recommended pattern for GPU functions:

  1. Module-Level Function Definition: Modal functions must be defined at module level and attached to an app instance
  2. Lazy Initialization: The function is set up on first use via _setup_modal_function()
  3. GPU Configuration: GPU type is set at function definition time (requires app restart to change)

Key Files

  • src/services/tts_modal.py - Modal GPU executor for Kokoro TTS
  • src/services/audio_processing.py - Unified audio service wrapper
  • src/utils/config.py - Configuration settings
  • src/app.py - UI integration with settings accordion

Configuration Options

All TTS configuration is available in src/utils/config.py:

tts_model: str = "hexgrad/Kokoro-82M"  # Model ID
tts_voice: str = "af_heart"  # Voice ID
tts_speed: float = 1.0  # Speed multiplier (0.5-2.0)
tts_gpu: str = "T4"  # GPU type (T4, A10, A100, etc.)
tts_timeout: int = 60  # Timeout in seconds
enable_audio_output: bool = True  # Enable/disable TTS

UI Configuration

TTS settings are available in the Settings accordion:

  • Voice Dropdown: Select from 20+ Kokoro voices (af_heart, af_bella, am_michael, etc.)
  • Speed Slider: Adjust speech speed (0.5x to 2.0x)
  • GPU Dropdown: Select GPU type (T4, A10, A100, L4, L40S) - visible only if Modal credentials configured
  • Enable Audio Output: Toggle TTS generation

Modal Function Implementation

The Modal GPU function is defined as:

@app.function(
    image=tts_image,  # Image with Kokoro dependencies
    gpu="T4",  # GPU type (from settings.tts_gpu)
    timeout=60,  # Timeout (from settings.tts_timeout)
)
def kokoro_tts_function(text: str, voice: str, speed: float) -> tuple[int, np.ndarray]:
    """Modal GPU function for Kokoro TTS."""
    from kokoro import KModel, KPipeline
    import torch
    
    model = KModel().to("cuda").eval()
    pipeline = KPipeline(lang_code=voice[0])
    pack = pipeline.load_voice(voice)
    
    for _, ps, _ in pipeline(text, voice, speed):
        ref_s = pack[len(ps) - 1]
        audio = model(ps, ref_s, speed)
        return (24000, audio.numpy())

Usage Flow

  1. User submits query with audio output enabled
  2. Research agent processes query and generates text response
  3. AudioService.generate_audio_output() is called with:
    • Response text
    • Voice (from UI dropdown or settings default)
    • Speed (from UI slider or settings default)
  4. TTSService.synthesize_async() calls Modal GPU function
  5. Modal executes Kokoro TTS on GPU
  6. Audio tuple (sample_rate, audio_array) is returned
  7. Audio is displayed in Gradio Audio component

Dependencies

Installed via uv add --optional:

  • gradio-client>=1.0.0 - For STT/OCR API calls
  • soundfile>=0.12.0 - For audio file I/O
  • Pillow>=10.0.0 - For image processing

Kokoro is installed in Modal image from source:

  • git+https://github.com/hexgrad/kokoro.git

GPU Types

Modal supports various GPU types:

  • T4: Cheapest, good for testing (default)
  • A10: Good balance of cost/performance
  • A100: Fastest, most expensive
  • L4: NVIDIA L4 GPU
  • L40S: NVIDIA L40S GPU

Note: GPU type is set at function definition time. Changes to settings.tts_gpu require app restart.

Error Handling

  • If Modal credentials not configured: TTS service unavailable (graceful degradation)
  • If Kokoro import fails: ConfigurationError raised
  • If synthesis fails: Returns None, logs warning, continues without audio
  • If GPU unavailable: Modal will queue or fail with clear error message

Configuration Connection

  1. Settings β†’ Implementation: settings.tts_voice, settings.tts_speed used as defaults
  2. UI β†’ Implementation: UI dropdowns/sliders passed to research_agent() function
  3. Implementation β†’ Modal: Voice and speed passed to Modal GPU function
  4. GPU Configuration: Set at function definition time (requires restart to change)

Testing

To test TTS:

  1. Ensure Modal credentials configured (MODAL_TOKEN_ID, MODAL_TOKEN_SECRET)
  2. Enable audio output in settings
  3. Submit a query
  4. Check audio output component for generated speech

References