# TransformerPrime: Text-to-Audio (TTA) Pipeline

## Project Overview
**TransformerPrime** is a high-performance, GPU-accelerated text-to-audio generation suite built on top of the Hugging Face `transformers` ecosystem. It is specifically optimized for NVIDIA RTX 40/50 series and datacenter GPUs (A100/H100/B200), targeting low-latency inference and efficient VRAM management (<10 GB for 1B-parameter models).

### Core Technologies
- **Runtime:** Python 3.10+, PyTorch 2.5.0+
- **Backbone:** HF `transformers` (v4.57+), `accelerate`
- **Optimization:** `bitsandbytes` (4-bit/8-bit quantization), FlashAttention-2
- **Interface:** Gradio (Web UI), CLI (argparse)
- **Audio:** `soundfile`, `numpy`

### Architecture
The project is structured as a modular pipeline wrapper:
- `src/text_to_audio/pipeline.py`: Core logic for model loading, inference, memory profiling, and streaming-style chunking.
- `src/text_to_audio/__init__.py`: Public API surface (`build_pipeline`, `TextToAudioPipeline`).
- `demo.py`: Unified entry point for the Gradio web interface and CLI operations.
- `tests/`: Unit tests for pipeline configuration and logic (mocking model downloads).

---

## Building and Running

### Setup
```bash
# Install dependencies
pip install -r requirements.txt

# Optional: Ensure bitsandbytes is installed for quantization support
pip install bitsandbytes
```

### Execution
- **Gradio Web UI (Default):**
  ```bash
  python demo.py --model csm-1b --quantize
  ```
- **CLI Mode:**
  ```bash
  python demo.py --cli --text "Hello from TransformerPrime." --output output.wav --quantize
  ```

### Testing
```bash
# Run unit tests from the root directory
PYTHONPATH=. pytest tests/
```

---

## Development Conventions

### TransformerPrime Persona
When extending this codebase, adhere to the **TransformerPrime** persona (defined in `.cursor/rules/TransformerPrime.mdc`):
- **Precision:** Never hallucinate config values or method signatures.
- **Modern Standards:** Favor `flash_attention_2` over eager implementations and `bfloat16` over `float16`.
- **Performance First:** Always consider VRAM footprint and Real-Time Factor (RTF). Use `generate_with_profile()` to validate changes.

### Coding Style
- **Type Safety:** Use Python type hints and `from __future__ import annotations`.
- **Configuration:** Use `dataclasses` (e.g., `PipelineConfig`) for structured parameters.
- **Device Management:** Use `accelerate` or `torch.cuda.is_available()` to handle device placement automatically (`device_map="auto"`).
- **Quantization:** Support `bitsandbytes` for 4-bit (`nf4`) and 8-bit loading to ensure compatibility with consumer GPUs.

### Key Symbols
- `build_pipeline()`: Primary factory for creating pipeline instances.
- `TextToAudioPipeline.generate_with_profile()`: Returns both audio and performance metrics (VRAM, RTF).
- `TextToAudioPipeline.stream_chunks()`: Generator for processing long audio outputs in fixed-duration slices.

---

## Future Roadmap (TODO)
- [ ] Add support for Kokoro-82M and Qwen3-TTS backends.
- [ ] Implement speculative decoding for faster inference on large TTA models.
- [ ] Add real-time streaming playback in the Gradio UI.