DAW_Sampler_Invader / GEMINI.md
Keith
Initial commit for HF Space
e3f3734

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

TransformerPrime: Text-to-Audio (TTA) Pipeline

Project Overview

TransformerPrime is a high-performance, GPU-accelerated text-to-audio generation suite built on top of the Hugging Face transformers ecosystem. It is specifically optimized for NVIDIA RTX 40/50 series and datacenter GPUs (A100/H100/B200), targeting low-latency inference and efficient VRAM management (<10 GB for 1B-parameter models).

Core Technologies

  • Runtime: Python 3.10+, PyTorch 2.5.0+
  • Backbone: HF transformers (v4.57+), accelerate
  • Optimization: bitsandbytes (4-bit/8-bit quantization), FlashAttention-2
  • Interface: Gradio (Web UI), CLI (argparse)
  • Audio: soundfile, numpy

Architecture

The project is structured as a modular pipeline wrapper:

  • src/text_to_audio/pipeline.py: Core logic for model loading, inference, memory profiling, and streaming-style chunking.
  • src/text_to_audio/__init__.py: Public API surface (build_pipeline, TextToAudioPipeline).
  • demo.py: Unified entry point for the Gradio web interface and CLI operations.
  • tests/: Unit tests for pipeline configuration and logic (mocking model downloads).

Building and Running

Setup

# Install dependencies
pip install -r requirements.txt

# Optional: Ensure bitsandbytes is installed for quantization support
pip install bitsandbytes

Execution

  • Gradio Web UI (Default):
    python demo.py --model csm-1b --quantize
    
  • CLI Mode:
    python demo.py --cli --text "Hello from TransformerPrime." --output output.wav --quantize
    

Testing

# Run unit tests from the root directory
PYTHONPATH=. pytest tests/

Development Conventions

TransformerPrime Persona

When extending this codebase, adhere to the TransformerPrime persona (defined in .cursor/rules/TransformerPrime.mdc):

  • Precision: Never hallucinate config values or method signatures.
  • Modern Standards: Favor flash_attention_2 over eager implementations and bfloat16 over float16.
  • Performance First: Always consider VRAM footprint and Real-Time Factor (RTF). Use generate_with_profile() to validate changes.

Coding Style

  • Type Safety: Use Python type hints and from __future__ import annotations.
  • Configuration: Use dataclasses (e.g., PipelineConfig) for structured parameters.
  • Device Management: Use accelerate or torch.cuda.is_available() to handle device placement automatically (device_map="auto").
  • Quantization: Support bitsandbytes for 4-bit (nf4) and 8-bit loading to ensure compatibility with consumer GPUs.

Key Symbols

  • build_pipeline(): Primary factory for creating pipeline instances.
  • TextToAudioPipeline.generate_with_profile(): Returns both audio and performance metrics (VRAM, RTF).
  • TextToAudioPipeline.stream_chunks(): Generator for processing long audio outputs in fixed-duration slices.

Future Roadmap (TODO)

  • Add support for Kokoro-82M and Qwen3-TTS backends.
  • Implement speculative decoding for faster inference on large TTA models.
  • Add real-time streaming playback in the Gradio UI.