# TransformerPrime: Text-to-Audio (TTA) Pipeline ## Project Overview **TransformerPrime** is a high-performance, GPU-accelerated text-to-audio generation suite built on top of the Hugging Face `transformers` ecosystem. It is specifically optimized for NVIDIA RTX 40/50 series and datacenter GPUs (A100/H100/B200), targeting low-latency inference and efficient VRAM management (<10 GB for 1B-parameter models). ### Core Technologies - **Runtime:** Python 3.10+, PyTorch 2.5.0+ - **Backbone:** HF `transformers` (v4.57+), `accelerate` - **Optimization:** `bitsandbytes` (4-bit/8-bit quantization), FlashAttention-2 - **Interface:** Gradio (Web UI), CLI (argparse) - **Audio:** `soundfile`, `numpy` ### Architecture The project is structured as a modular pipeline wrapper: - `src/text_to_audio/pipeline.py`: Core logic for model loading, inference, memory profiling, and streaming-style chunking. - `src/text_to_audio/__init__.py`: Public API surface (`build_pipeline`, `TextToAudioPipeline`). - `demo.py`: Unified entry point for the Gradio web interface and CLI operations. - `tests/`: Unit tests for pipeline configuration and logic (mocking model downloads). --- ## Building and Running ### Setup ```bash # Install dependencies pip install -r requirements.txt # Optional: Ensure bitsandbytes is installed for quantization support pip install bitsandbytes ``` ### Execution - **Gradio Web UI (Default):** ```bash python demo.py --model csm-1b --quantize ``` - **CLI Mode:** ```bash python demo.py --cli --text "Hello from TransformerPrime." --output output.wav --quantize ``` ### Testing ```bash # Run unit tests from the root directory PYTHONPATH=. pytest tests/ ``` --- ## Development Conventions ### TransformerPrime Persona When extending this codebase, adhere to the **TransformerPrime** persona (defined in `.cursor/rules/TransformerPrime.mdc`): - **Precision:** Never hallucinate config values or method signatures. - **Modern Standards:** Favor `flash_attention_2` over eager implementations and `bfloat16` over `float16`. - **Performance First:** Always consider VRAM footprint and Real-Time Factor (RTF). Use `generate_with_profile()` to validate changes. ### Coding Style - **Type Safety:** Use Python type hints and `from __future__ import annotations`. - **Configuration:** Use `dataclasses` (e.g., `PipelineConfig`) for structured parameters. - **Device Management:** Use `accelerate` or `torch.cuda.is_available()` to handle device placement automatically (`device_map="auto"`). - **Quantization:** Support `bitsandbytes` for 4-bit (`nf4`) and 8-bit loading to ensure compatibility with consumer GPUs. ### Key Symbols - `build_pipeline()`: Primary factory for creating pipeline instances. - `TextToAudioPipeline.generate_with_profile()`: Returns both audio and performance metrics (VRAM, RTF). - `TextToAudioPipeline.stream_chunks()`: Generator for processing long audio outputs in fixed-duration slices. --- ## Future Roadmap (TODO) - [ ] Add support for Kokoro-82M and Qwen3-TTS backends. - [ ] Implement speculative decoding for faster inference on large TTA models. - [ ] Add real-time streaming playback in the Gradio UI.