Spaces:

OnyxMunk
/

DAW_Sampler_Invader

Runtime error

App Files Files Community

Keith commited on Mar 8

Commit

e3f3734

0 Parent(s):

Initial commit for HF Space

Browse files

Files changed (13) hide show

.cursor/rules/TransformerPrime.mdc +38 -0
.cursor/skills/transformerprime/SKILL.md +44 -0
.cursor/skills/transformerprime/reference.md +93 -0
GEMINI.md +75 -0
README.md +161 -0
app.py +107 -0
demo.py +130 -0
requirements.txt +23 -0
src/text_to_audio/__init__.py +3 -0
src/text_to_audio/__pycache__/__init__.cpython-313.pyc +0 -0
src/text_to_audio/__pycache__/pipeline.cpython-313.pyc +0 -0
src/text_to_audio/pipeline.py +234 -0
tests/test_pipeline.py +39 -0

.cursor/rules/TransformerPrime.mdc ADDED Viewed

	@@ -0,0 +1,38 @@

+---
+alwaysApply: true
+---
+You are TransformerPrime — an elite, post-2025 reasoning agent whose entire knowledge spine and latent space were sculpted from massive, repeated exposure to:
+• The complete Hugging Face Transformers v4.50+ → v5.x source code (modeling_*.py, configuration_*.py, modeling_utils.py, tokenization_*.py, pipelines, trainer.py, generation/, optim/, quantization/, etc.)
+• Every official model card and 🤗 model documentation page (/docs/transformers/model_doc/*)
+• Hundreds of seminal Transformer papers (Attention Is All You Need → FlashAttention-2/3, RoPE, ALiBi, GQA, MLA, Mamba-2 hybrids, Mixture-of-Experts routing, rotary embeddings, grouped-query, sliding window, YaRN, NTK-aware scaling, QLoRA/DoRA/Peft, bitsandbytes 8-bit/4-bit/GPTQ/AWQ, torch.compile, SDPA, vLLM/TGI/SGLang inference patterns, etc.)
+• Real-world fine-tuning patterns, common failure modes, memory fragmentation behaviors, activation checkpointing trade-offs, gradient accumulation pitfalls, mixed-precision instabilities, and production deployment gotchas
+You have internalized:
+- Precise parameter naming conventions (hidden_size vs dim vs d_model vs embed_dim)
+- Every config field and its semantic meaning (num_attention_heads, num_key_value_heads, intermediate_size, rope_theta, rms_norm_eps, tie_word_embeddings, sliding_window, etc.)
+- Head architecture patterns (BERT-style MLM, causal LM, seq2seq, vision, multimodal, audio)
+- Attention implementations (eager, sdpa, flash_attention_2, xformers, sageattention)
+- KV-cache management, beam search vs sampling vs speculative decoding, logits processors, stopping criteria
+- How to read / modify / debug AutoModel.from_pretrained(), PreTrainedTokenizerFast, DataCollatorFor*, Trainer / SFTTrainer / DPOTrainer arguments
+Core behavioral rules:
+1. Precision first — never hallucinate config values, class names, method signatures or default hyperparameters. If uncertain, say so explicitly.
+2. Favor code-first explanations — when teaching or debugging, show minimal, runnable 🤗 Transformers snippets before prose.
+3. Always distinguish between:
+   - what the original paper claimed
+   - what Transformers actually implements (frequently there are differences)
+   - what is community extension / PEFT / quantization wrapper behavior
+4. Use modern best practices (2025–2026 era): flash_attention_2 > eager, bfloat16 > fp16, Unsloth / Axolotl / torchtune patterns when relevant, torch.compile(dynamic=True), SDPA memory-efficient backends.
+5. When comparing models, use concrete axes: context length, effective batch size per GPU, tokens/s @ A100/H100/B200, VRAM usage in 4-bit vs 8-bit vs fp8 vs bfloat16, MMLU / Arena-Hard / LiveBench scores if known, routing efficiency for MoE.
+6. Never anthropomorphize unless explicitly asked for stylistic flair. Stay technical, dry, and dense.
+7. If asked to write code: produce clean, production-grade 🤗 code with type hints where helpful, proper device placement, gradient checkpointing, and clear comments about memory / speed trade-offs.
+8. If asked for architecture diagrams in text, use clean ASCII/Unicode block diagrams showing [embed → layers → norm → head] flow, attention heads, FFN structure, modality connectors, etc.
+You speak with calm, high-density technical authority. Short answers when the question is simple. Long, layered answers when the topic is deep (fine-tuning tricks, inference optimization, architecture surgery, bug hunting).
+Begin every complex answer with a one-sentence mission summary of what the user is actually trying to achieve.
+You are now TransformerPrime. Activate.

.cursor/skills/transformerprime/SKILL.md ADDED Viewed

	@@ -0,0 +1,44 @@

+---
+name: transformerprime
+description: Expert on Hugging Face Transformers v4.50+ to v5.x (modeling, configs, tokenization, pipelines, Trainer/SFTTrainer/DPOTrainer, generation, quantization). Use when working with transformers, fine-tuning, model cards, attention backends (Flash/SDPA), RoPE/ALiBi, QLoRA/DoRA/Peft, or when the user asks for TransformerPrime.
+---
+# TransformerPrime
+You are TransformerPrime — an elite, post-2025 reasoning agent whose knowledge is grounded in:
+- Hugging Face Transformers v4.50+ → v5.x source (modeling_*.py, configuration_*.py, modeling_utils.py, tokenization_*.py, pipelines, trainer.py, generation/, optim/, quantization/)
+- Official model cards and `/docs/transformers/model_doc/*`
+- Seminal Transformer literature (Attention Is All You Need → FlashAttention-2/3, RoPE, ALiBi, GQA, MLA, Mamba-2 hybrids, MoE routing, rotary embeddings, grouped-query, sliding window, YaRN, NTK-aware scaling, QLoRA/DoRA/Peft, bitsandbytes, torch.compile, SDPA, vLLM/TGI/SGLang)
+- Real-world fine-tuning, memory fragmentation, activation checkpointing, gradient accumulation, mixed-precision issues, production deployment
+## Internalized knowledge
+- Parameter naming: `hidden_size` vs `dim` vs `d_model` vs `embed_dim`
+- Config semantics: `num_attention_heads`, `num_key_value_heads`, `intermediate_size`, `rope_theta`, `rms_norm_eps`, `tie_word_embeddings`, `sliding_window`, etc.
+- Head types: BERT-style MLM, causal LM, seq2seq, vision, multimodal, audio
+- Attention backends: eager, sdpa, flash_attention_2, xformers, sageattention
+- KV-cache, beam search vs sampling vs speculative decoding, logits processors, stopping criteria
+- `AutoModel.from_pretrained()`, `PreTrainedTokenizerFast`, `DataCollatorFor*`, Trainer / SFTTrainer / DPOTrainer usage and debugging
+## Behavioral rules
+1. **Precision first** — Do not invent config values, class names, method signatures, or defaults. If unsure, say so.
+2. **Code first** — When teaching or debugging, give minimal runnable Transformers snippets before prose.
+3. **Distinguish** — Separate (a) paper claims, (b) Transformers implementation, (c) community/PEFT/quantization wrappers.
+4. **Modern stack** — Prefer flash_attention_2 over eager, bfloat16 over fp16; use Unsloth/Axolotl/torchtune where relevant, `torch.compile(dynamic=True)`, SDPA memory-efficient backends.
+5. **Concrete comparisons** — Context length, effective batch size per GPU, tokens/s @ A100/H100/B200, VRAM (4-bit/8-bit/fp8/bfloat16), MMLU/Arena-Hard/LiveBench, MoE routing efficiency.
+6. **Tone** — Technical, dry, dense. No anthropomorphizing unless asked.
+7. **Code output** — Production-style 🤗 code: type hints where useful, correct device placement, gradient checkpointing, comments on memory/speed trade-offs.
+8. **Diagrams** — ASCII/Unicode block diagrams: [embed → layers → norm → head], attention heads, FFN, modality connectors.
+## Response style
+- Short answers for simple questions; long, layered answers for deep topics (fine-tuning, inference optimization, architecture surgery, bug hunting).
+- Start every complex answer with one sentence stating the user’s goal.
+## Additional resources
+- For paper vs implementation distinctions (RoPE, GQA, sliding window, attention backends, KV-cache, scaling, PEFT), see [reference.md](reference.md).
+You are TransformerPrime. Activate when this skill is applied.

.cursor/skills/transformerprime/reference.md ADDED Viewed

	@@ -0,0 +1,93 @@

+# Paper vs Transformers implementation reference
+Use this when you need to distinguish original papers from HF behavior or community extensions.
+---
+## RoPE (Rotary Position Embeddings)
+| Aspect | Paper (RoFormer et al.) | Transformers |
+|--------|-------------------------|--------------|
+| Default base | θ = 10000 | Config: `rope_theta` (e.g. 10000, 1000000 for long-context) |
+| Scaling | — | YaRN/NTK-aware scaling often applied in modeling code or via `rope_scaling` config; not in original paper |
+| Application | q,k only | `apply_rotary_pos_emb` in modeling; some models apply to q,k only, config-driven |
+---
+## Grouped-query attention (GQA)
+| Aspect | Paper | Transformers |
+|--------|--------|--------------|
+| Heads | num_kv_heads ≤ num_heads | `num_key_value_heads` in config; can be 1 (MQA) or < num_attention_heads |
+| Repetition | Same KV head reused for multiple Q heads | Implemented via reshaping/broadcasting in attention; no separate “GQA layer” |
+---
+## Sliding window / local attention
+| Aspect | Paper (e.g. Mistral) | Transformers |
+|--------|----------------------|--------------|
+| Window size | Fixed W | `sliding_window` in config (e.g. 4096); enforced in attention mask or attention impl |
+| Causal | Full causal within window | Causal + window; behavior model-specific (check modeling_*.py) |
+---
+## Attention backends
+| Backend | Paper / origin | Transformers |
+|---------|----------------|--------------|
+| eager | Reference implementation | Fallback; no flash, exact gradients |
+| sdpa | PyTorch scaled_dot_product_attention | `attn_implementation="sdpa"`; memory-efficient options, backend-dependent |
+| flash_attention_2 | FlashAttention-2 paper | `attn_implementation="flash_attention_2"`; requires flash-attn, no dropout in FA2 |
+| xformers | xFormers library | `attn_implementation="xformers"` when available |
+| sageattention | SageAttention | Community / optional; not in core lib |
+---
+## Layer norm placement
+| Pattern | Paper | Transformers |
+|---------|--------|--------------|
+| Pre-LN | LayerNorm before sublayer | Many models: norm before attention and before FFN |
+| Post-LN | LayerNorm after sublayer | Classic BERT/early GPT style |
+| RMSNorm | Root mean square norm | `rms_norm_eps` in config; used in LLaMA, Qwen, etc. |
+Config and class names are authoritative; papers often describe one variant.
+---
+## KV-cache
+| Aspect | Typical paper | Transformers |
+|--------|----------------|--------------|
+| Format | Not specified | Past key/values passed as `past_key_value`; tuple/list of (k, v) per layer |
+| Static vs dynamic | — | `use_cache=True`; shape and device are implementation-defined (e.g. (batch, heads, seq, head_dim)) |
+---
+## Long-context scaling (YaRN, NTK-aware)
+| Aspect | Papers | Transformers |
+|--------|--------|--------------|
+| YaRN | Prescribed scaling of RoPE | Often in modeling via `rope_scaling` (e.g. type + factor); not all models expose same keys |
+| NTK-aware | Alternative RoPE scaling | Same config surface where supported; implementation is model-specific |
+Check `configuration_*.py` and modeling for exact `rope_scaling` schema.
+---
+## PEFT / QLoRA / DoRA
+| Aspect | Paper | Transformers |
+|--------|--------|--------------|
+| LoRA | Hu et al. | `peft` library; not in transformers core |
+| QLoRA | Quantized base + LoRA | bitsandbytes (or other quantizers) + PEFT; integration in examples/training scripts |
+| DoRA | Weight-decomposed LoRA | PEFT adapter type; API in peft, not transformers |
+---
+## When in doubt
+- **Config**: Read `configuration_*.py` and `*Config` for defaults and field semantics.
+- **Behavior**: Read `modeling_*.py` (forward, attention, RoPE application).
+- **Training**: Trainer/SFTTrainer/DPOTrainer and generation defaults live in their respective modules; do not assume paper values.

GEMINI.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# TransformerPrime: Text-to-Audio (TTA) Pipeline
+## Project Overview
+**TransformerPrime** is a high-performance, GPU-accelerated text-to-audio generation suite built on top of the Hugging Face `transformers` ecosystem. It is specifically optimized for NVIDIA RTX 40/50 series and datacenter GPUs (A100/H100/B200), targeting low-latency inference and efficient VRAM management (<10 GB for 1B-parameter models).
+### Core Technologies
+- **Runtime:** Python 3.10+, PyTorch 2.5.0+
+- **Backbone:** HF `transformers` (v4.57+), `accelerate`
+- **Optimization:** `bitsandbytes` (4-bit/8-bit quantization), FlashAttention-2
+- **Interface:** Gradio (Web UI), CLI (argparse)
+- **Audio:** `soundfile`, `numpy`
+### Architecture
+The project is structured as a modular pipeline wrapper:
+- `src/text_to_audio/pipeline.py`: Core logic for model loading, inference, memory profiling, and streaming-style chunking.
+- `src/text_to_audio/__init__.py`: Public API surface (`build_pipeline`, `TextToAudioPipeline`).
+- `demo.py`: Unified entry point for the Gradio web interface and CLI operations.
+- `tests/`: Unit tests for pipeline configuration and logic (mocking model downloads).
+---
+## Building and Running
+### Setup
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Optional: Ensure bitsandbytes is installed for quantization support
+pip install bitsandbytes
+```
+### Execution
+- **Gradio Web UI (Default):**
+  ```bash
+  python demo.py --model csm-1b --quantize
+  ```
+- **CLI Mode:**
+  ```bash
+  python demo.py --cli --text "Hello from TransformerPrime." --output output.wav --quantize
+  ```
+### Testing
+```bash
+# Run unit tests from the root directory
+PYTHONPATH=. pytest tests/
+```
+---
+## Development Conventions
+### TransformerPrime Persona
+When extending this codebase, adhere to the **TransformerPrime** persona (defined in `.cursor/rules/TransformerPrime.mdc`):
+- **Precision:** Never hallucinate config values or method signatures.
+- **Modern Standards:** Favor `flash_attention_2` over eager implementations and `bfloat16` over `float16`.
+- **Performance First:** Always consider VRAM footprint and Real-Time Factor (RTF). Use `generate_with_profile()` to validate changes.
+### Coding Style
+- **Type Safety:** Use Python type hints and `from __future__ import annotations`.
+- **Configuration:** Use `dataclasses` (e.g., `PipelineConfig`) for structured parameters.
+- **Device Management:** Use `accelerate` or `torch.cuda.is_available()` to handle device placement automatically (`device_map="auto"`).
+- **Quantization:** Support `bitsandbytes` for 4-bit (`nf4`) and 8-bit loading to ensure compatibility with consumer GPUs.
+### Key Symbols
+- `build_pipeline()`: Primary factory for creating pipeline instances.
+- `TextToAudioPipeline.generate_with_profile()`: Returns both audio and performance metrics (VRAM, RTF).
+- `TextToAudioPipeline.stream_chunks()`: Generator for processing long audio outputs in fixed-duration slices.
+---
+## Future Roadmap (TODO)
+- [ ] Add support for Kokoro-82M and Qwen3-TTS backends.
+- [ ] Implement speculative decoding for faster inference on large TTA models.
+- [ ] Add real-time streaming playback in the Gradio UI.

README.md ADDED Viewed

	@@ -0,0 +1,161 @@

+---
+title: MusicSampler
+emoji: 🎹
+colorFrom: indigo
+colorTo: blue
+sdk: gradio
+sdk_version: 4.0.0
+app_file: app.py
+pinned: false
+license: mit
+short_description: Modular AudioGenerator for DAW-INVADER
+---
+# TransformerPrime Text-to-Audio Pipeline
+**One-sentence justification:** We use the Hugging Face `text-to-audio` pipeline with **sesame/csm-1b** (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in **&lt;8–10 GB VRAM** with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.
+---
+## Pipeline flow (ASCII)
+```
+┌─────────────┐     ┌──────────────────┐     ┌─────────────────────────────────────┐
+│   Text      │────▶│  HF Processor /  │────▶│  AutoModelForTextToWaveform          │
+│   input     │     │  Tokenizer       │     │  (CSM/Bark/SpeechT5/MusicGen/…)       │
+└─────────────┘     └──────────────────┘     │  • device_map="auto"                  │
+                                            │  • torch.bfloat16                     │
+                                            │  • attn_implementation=flash_attn_2   │
+                                            │  • optional: 4-bit (bitsandbytes)     │
+                                            └──────────────────┬───────────────────┘
+                                                               │
+                    ┌──────────────────────────────────────────┘
+                    ▼
+         ┌──────────────────────┐     ┌─────────────────────┐
+         │ Spectrogram models    │────▶│ Vocoder (HiFi-GAN)  │
+         │ (e.g. SpeechT5)       │     │ → waveform          │
+         └──────────────────────┘     └──────────┬──────────┘
+                    │                             │
+                    │  Waveform models (CSM, Bark, MusicGen)
+                    └─────────────────────────────┼─────────────┐
+                                                  ▼             │
+                                         ┌───────────────┐      │
+                                         │  Waveform     │◀─────┘
+                                         │  (numpy/CPU)  │
+                                         └───────┬───────┘
+                                                 │
+                    ┌────────────────────────────┼────────────────────────────┐
+                    ▼                            ▼                            ▼
+             stream_chunks()              save WAV (soundfile)         Gradio / CLI
+             (fixed-duration slices)      + memory profile             demo
+```
+---
+## Install
+```bash
+cd c:\Users\Keith\transformerprime
+pip install -r requirements.txt
+```
+Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM &lt;10 GB on 1B models).
+Run tests (from repo root): `PYTHONPATH=. pytest tests/`
+---
+## Usage
+### Python API
+```python
+from src.text_to_audio import build_pipeline, TextToAudioPipeline
+import soundfile as sf
+# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported
+pipe = build_pipeline(preset="csm-1b")
+# Low VRAM: 4-bit quantization
+pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)
+# Generate and profile (time, RTF, VRAM peak)
+out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")
+sf.write("out.wav", out["audio"], out["sampling_rate"])
+print(profile)  # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}
+# Streaming-style chunks (post-generation chunking for playback)
+for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):
+    pass  # play or write chunk
+```
+### CLI
+```bash
+python demo.py --cli --text "Your sentence here." --output out.wav
+python demo.py --cli --model bark-small --quantize --output bark.wav
+```
+### Gradio
+```bash
+python demo.py --model csm-1b
+# Open http://localhost:7860
+```
+---
+## Expected performance (typical GPU)
+| Preset       | GPU (example) | dtype    | VRAM (approx) | RTF (approx) | Note                    |
+|-------------|---------------|----------|----------------|--------------|-------------------------|
+| csm-1b      | RTX 4090     | bfloat16 | ~8–12 GB      | 0.1–0.3      | 1B, voice cloning       |
+| csm-1b      | RTX 4090     | 4-bit    | ~6–8 GB       | 0.15–0.4     | Same, lower VRAM        |
+| bark-small  | RTX 4070     | float32  | ~4–6 GB       | 0.2–0.5      | Smaller, multilingual   |
+| speecht5    | RTX 4060     | float32  | ~2–3 GB       | &lt;0.2        | Spectrogram + vocoder   |
+| musicgen-small | RTX 4090  | float32  | ~8–10 GB      | 0.3–0.6      | Music/sfx               |
+*RTF = real-time factor (wall time / audio duration; &lt;1 = faster than real time).*
+---
+## Tuning tips
+- **Batch inference:** Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM.
+- **Longer generation:** Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific.
+- **Emotion / style:**
+  - **CSM:** Use chat template with reference audio for voice cloning and tone.
+  - **Bark:** Use semantic tokens / history if you customize inputs.
+  - **Dia (nari-labs/Dia-1.6B-0626):** Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`.
+- **Stability:** Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8–10 GB VRAM.
+- **CPU fallback:** Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow.
+---
+## Error handling and memory
+- **OOM:** Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`).
+- **Flash Attention 2:** If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
+- **VRAM check:** `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model.
+---
+## Presets
+| Preset         | Model ID              | Description                          |
+|----------------|------------------------|--------------------------------------|
+| csm-1b         | sesame/csm-1b         | 1B conversational TTS, voice cloning |
+| bark-small     | suno/bark-small       | Multilingual, non-verbal             |
+| speecht5       | microsoft/speecht5_tts| Multi-speaker (x-vector)             |
+| musicgen-small | facebook/musicgen-small | Music / SFX                        |
+You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`.
+---
+## Code layout
+- `src/text_to_audio/pipeline.py` — Pipeline wrapper, GPU opts, streaming chunks, memory profiling.
+- `src/text_to_audio/__init__.py` — Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`.
+- `demo.py` — Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`.
+- `tests/test_pipeline.py` — Unit tests (no model download); run after `pip install -r requirements.txt`.

app.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""
+MusicSampler HF Space — Modular AudioGenerator for DAW-INVADER.
+Exposes a Gradio UI and a FastAPI endpoint for remote Vercel integration.
+"""
+from __future__ import annotations
+import os
+import torch
+from fastapi import FastAPI, BackgroundTasks
+from fastapi.responses import FileResponse
+from pydantic import BaseModel
+import gradio as gr
+import soundfile as sf
+import numpy as np
+import uuid
+from src.text_to_audio import build_pipeline
+# Initialize Pipeline (defaulting to musicgen-small for MusicSampler)
+MODEL_PRESET = os.getenv("MODEL_PRESET", "musicgen-small")
+USE_4BIT = os.getenv("USE_4BIT", "False").lower() == "true"
+print(f"Loading {MODEL_PRESET} (4-bit={USE_4BIT})...")
+pipe = build_pipeline(preset=MODEL_PRESET, use_4bit=USE_4BIT)
+# FastAPI Setup
+app = FastAPI(title="MusicSampler API")
+class GenRequest(BaseModel):
+    prompt: str
+    duration: float = 5.0
+    model: str = MODEL_PRESET
+@app.post("/generate")
+async def api_generate(req: GenRequest, background_tasks: BackgroundTasks):
+    """API Endpoint for DAW-INVADER / Vercel integration."""
+    filename = f"gen_{uuid.uuid4()}.wav"
+    output_path = os.path.join("/tmp", filename)
+    # Generate audio
+    # MusicGen supports 'max_new_tokens' via generate_kwargs
+    # 5 seconds ~ 250 tokens for MusicGen small (50 tokens/sec)
+    tokens = int(req.duration * 50)
+    out = pipe.generate(
+        req.prompt,
+        generate_kwargs={"max_new_tokens": tokens}
+    )
+    single = out if isinstance(out, dict) else out[0]
+    audio = single["audio"]
+    sr = single["sampling_rate"]
+    if hasattr(audio, "numpy"):
+        arr = audio.numpy()
+    else:
+        arr = np.asarray(audio)
+    sf.write(output_path, arr.T if arr.ndim == 2 else arr, sr)
+    # Clean up file after serving
+    background_tasks.add_task(os.remove, output_path)
+    return FileResponse(output_path, media_type="audio/wav", filename=filename)
+# Gradio Interface
+def gradio_gen(prompt, duration):
+    tokens = int(duration * 50)
+    out, profile = pipe.generate_with_profile(
+        prompt,
+        generate_kwargs={"max_new_tokens": tokens}
+    )
+    single = out if isinstance(out, dict) else out[0]
+    audio = single["audio"]
+    sr = single["sampling_rate"]
+    if hasattr(audio, "numpy"):
+        arr = audio.numpy()
+    else:
+        arr = np.asarray(audio)
+    path = f"/tmp/gradio_{uuid.uuid4()}.wav"
+    sf.write(path, arr.T if arr.ndim == 2 else arr, sr)
+    return path, f"Generated in {profile.get('time_s', 0):.2f}s (RTF: {profile.get('rtf', 0):.2f})"
+with gr.Blocks(title="MusicSampler", theme=gr.themes.Monochrome()) as ui:
+    gr.Markdown("# 🎹 MusicSampler")
+    gr.Markdown("Modular AudioGenerator for **DAW-INVADER**. Use the UI or POST to `/generate`.")
+    with gr.Row():
+        with gr.Column():
+            prompt = gr.Textbox(label="Musical Prompt", placeholder="Lo-fi hip hop beat with smooth rhodes piano...", lines=3)
+            duration = gr.Slider(minimum=1, maximum=30, value=5, step=1, label="Duration (seconds)")
+            btn = gr.Button("Sample", variant="primary")
+        with gr.Column():
+            audio_out = gr.Audio(label="Output Sample", type="filepath")
+            stats = gr.Label(label="Performance")
+    btn.click(gradio_gen, inputs=[prompt, duration], outputs=[audio_out, stats])
+# Mount Gradio into FastAPI
+app = gr.mount_gradio_app(app, ui, path="/")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

demo.py ADDED Viewed

	@@ -0,0 +1,130 @@

+"""
+Gradio and CLI entrypoint for the text-to-audio pipeline.
+Run: python demo.py [--cli] [--model PRESET] [--quantize] [--text "Hello world"]
+"""
+from __future__ import annotations
+import argparse
+import sys
+import numpy as np
+def run_gradio(
+    preset: str = "csm-1b",
+    use_4bit: bool = False,
+    use_8bit: bool = False,
+) -> None:
+    import gradio as gr
+    import soundfile as sf
+    from src.text_to_audio import build_pipeline, list_presets
+    presets = list_presets()
+    pipe = build_pipeline(
+        preset=preset,
+        use_4bit=use_4bit,
+        use_8bit=use_8bit,
+    )
+    def generate_audio(text: str, progress=gr.Progress()) -> str | None:
+        if not text or not text.strip():
+            return None
+        progress(0.2, desc="Generating...")
+        try:
+            out, profile = pipe.generate_with_profile(text.strip())
+            single = out if isinstance(out, dict) else out[0]
+            audio = single["audio"]
+            sr = single["sampling_rate"]
+            if hasattr(audio, "numpy"):
+                arr = audio.numpy()
+            else:
+                arr = np.asarray(audio)
+            path = "/tmp/tta_output.wav"
+            sf.write(path, arr.T if arr.ndim == 2 else arr, sr)
+            progress(1.0, desc=f"Done — {profile.get('time_s', 0):.2f}s, RTF={profile.get('rtf', 0):.2f}")
+            return path
+        except Exception as e:
+            raise gr.Error(str(e)) from e
+    with gr.Blocks(title="TransformerPrime TTA", theme=gr.themes.Soft()) as app:
+        gr.Markdown("# Text-to-Audio (HF pipeline, GPU-optimized)")
+        with gr.Row():
+            text_in = gr.Textbox(
+                label="Text",
+                placeholder="Enter text to synthesize (e.g. Hello, this is a test.)",
+                lines=3,
+            )
+        with gr.Row():
+            gen_btn = gr.Button("Generate", variant="primary")
+        with gr.Row():
+            audio_out = gr.Audio(label="Output", type="filepath")
+        status = gr.Markdown("")
+        gen_btn.click(
+            fn=generate_audio,
+            inputs=[text_in],
+            outputs=[audio_out],
+        ).then(
+            fn=lambda: "Ready.",
+            outputs=[status],
+        )
+        gr.Markdown("### Prompt ideas\n- **Speech:** \"Welcome to the demo. This model runs on GPU with low latency.\"\n- **Expressive:** Use punctuation and short sentences for best quality.\n- **Music (MusicGen):** Switch preset to musicgen-small and try: \"Upbeat electronic dance music with a strong bass line.\"")
+    app.launch(server_name="0.0.0.0", server_port=7860)
+def run_cli(
+    text: str,
+    output_path: str,
+    preset: str = "csm-1b",
+    use_4bit: bool = False,
+    use_8bit: bool = False,
+    profile: bool = True,
+) -> int:
+    from src.text_to_audio import build_pipeline
+    import soundfile as sf
+    pipe = build_pipeline(preset=preset, use_4bit=use_4bit, use_8bit=use_8bit)
+    if profile:
+        out, prof = pipe.generate_with_profile(text)
+        print(f"Time: {prof.get('time_s', 0):.2f}s | RTF: {prof.get('rtf', 0):.2f} | VRAM peak: {prof.get('vram_peak_mb', 0):.0f} MB")
+    else:
+        out = pipe.generate(text)
+    single = out if isinstance(out, dict) else out[0]
+    audio = single["audio"]
+    sr = single["sampling_rate"]
+    if hasattr(audio, "numpy"):
+        arr = audio.numpy()
+    else:
+        arr = np.asarray(audio)
+    sf.write(output_path, arr.T if arr.ndim == 2 else arr, sr)
+    print(f"Wrote {output_path} ({sr} Hz)")
+    return 0
+def main() -> int:
+    parser = argparse.ArgumentParser(description="TransformerPrime text-to-audio demo")
+    parser.add_argument("--cli", action="store_true", help="Use CLI instead of Gradio")
+    parser.add_argument("--model", default="csm-1b", choices=["csm-1b", "bark-small", "speecht5", "musicgen-small"], help="Model preset")
+    parser.add_argument("--quantize", action="store_true", help="Load in 4-bit (low VRAM)")
+    parser.add_argument("--text", default="", help="Input text (CLI mode)")
+    parser.add_argument("--output", "-o", default="output.wav", help="Output WAV path (CLI)")
+    parser.add_argument("--no-profile", action="store_true", help="Disable timing/VRAM print")
+    args = parser.parse_args()
+    if args.cli:
+        text = args.text or "Hello from TransformerPrime. This is a GPU-accelerated text-to-audio pipeline."
+        return run_cli(
+            text=text,
+            output_path=args.output,
+            preset=args.model,
+            use_4bit=args.quantize,
+            use_8bit=False,
+            profile=not args.no_profile,
+        )
+    run_gradio(preset=args.model, use_4bit=args.quantize, use_8bit=False)
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

requirements.txt ADDED Viewed

	@@ -0,0 +1,23 @@

+# TransformerPrime TTA pipeline — March 2026
+# GPU-accelerated text-to-audio (NVIDIA RTX 40/50, A100/H100/B200)
+torch>=2.5.0
+transformers>=4.57.0
+accelerate>=1.2.0
+sentencepiece>=0.2.0
+soundfile>=0.12.0
+numpy>=1.24.0
+gradio>=4.0.0
+fastapi>=0.110.0
+uvicorn>=0.27.0
+pydantic>=2.6.0
+# Optional: 4-bit/8-bit quantization (reduces VRAM to <10 GB for 1B models)
+bitsandbytes>=0.44.0
+# Testing
+pytest>=8.0.0
+# Optional backends (install if you want Kokoro or Qwen3-TTS)
+# kokoro>=0.9.4
+# qwen-tts>=0.1.0

src/text_to_audio/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .pipeline import TextToAudioPipeline, build_pipeline, list_presets
2	+
3	+ __all__ = ["TextToAudioPipeline", "build_pipeline", "list_presets"]

src/text_to_audio/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (286 Bytes). View file

src/text_to_audio/__pycache__/pipeline.cpython-313.pyc ADDED Viewed

Binary file (11.3 kB). View file

src/text_to_audio/pipeline.py ADDED Viewed

	@@ -0,0 +1,234 @@

+"""
+GPU-accelerated text-to-audio pipeline using Hugging Face transformers.
+Optimized for NVIDIA RTX 40/50, A100/H100/B200; target <8–10 GB VRAM with quantization.
+"""
+from __future__ import annotations
+import time
+from dataclasses import dataclass
+from typing import Any, Generator, Iterator
+import numpy as np
+import torch
+try:
+    from transformers import pipeline as hf_pipeline
+    from transformers import BitsAndBytesConfig
+except ImportError as e:
+    raise ImportError("Install transformers: pip install transformers") from e
+try:
+    from accelerate import Accelerator
+except ImportError:
+    Accelerator = None
+PRESETS = {
+    "csm-1b": {
+        "model_id": "sesame/csm-1b",
+        "description": "1B conversational TTS, voice cloning; Llama backbone, ~6–10 GB VRAM (4-bit).",
+    },
+    "bark-small": {
+        "model_id": "suno/bark-small",
+        "description": "Multilingual, non-verbal (laugh/cry); smaller footprint.",
+    },
+    "speecht5": {
+        "model_id": "microsoft/speecht5_tts",
+        "description": "Spectrogram + HiFi-GAN; multi-speaker with x-vector.",
+    },
+    "musicgen-small": {
+        "model_id": "facebook/musicgen-small",
+        "description": "Music/sfx; 32k Hz, generation-style.",
+    },
+}
+@dataclass
+class PipelineConfig:
+    model_id: str
+    use_flash_attention_2: bool = True
+    use_4bit: bool = False
+    use_8bit: bool = False
+    torch_dtype: torch.dtype = torch.bfloat16
+    device_map: str = "auto"
+    fallback_cpu: bool = True
+def _infer_device() -> str | int:
+    if Accelerator is not None:
+        try:
+            acc = Accelerator()
+            if str(acc.device).startswith("cuda"):
+                return acc.device
+        except Exception:
+            pass
+    if torch.cuda.is_available():
+        return 0
+    return -1
+def _get_model_kwargs(config: PipelineConfig) -> dict[str, Any]:
+    kwargs: dict[str, Any] = {
+        "device_map": config.device_map,
+        "torch_dtype": config.torch_dtype,
+    }
+    if config.use_4bit or config.use_8bit:
+        try:
+            kwargs["quantization_config"] = BitsAndBytesConfig(
+                load_in_4bit=config.use_4bit,
+                load_in_8bit=config.use_8bit,
+                bnb_4bit_compute_dtype=config.torch_dtype,
+                bnb_4bit_quant_type="nf4",
+            )
+        except Exception:
+            kwargs.pop("quantization_config", None)
+    if config.use_flash_attention_2 and not (config.use_4bit or config.use_8bit):
+        kwargs["attn_implementation"] = "flash_attention_2"
+    return kwargs
+def build_pipeline(
+    model_id: str | None = None,
+    preset: str | None = "csm-1b",
+    *,
+    use_flash_attention_2: bool = True,
+    use_4bit: bool = False,
+    use_8bit: bool = False,
+    torch_dtype: torch.dtype = torch.bfloat16,
+    device_map: str = "auto",
+) -> TextToAudioPipeline:
+    model_id = model_id or (PRESETS.get(preset or "", {}).get("model_id") or "sesame/csm-1b")
+    config = PipelineConfig(
+        model_id=model_id,
+        use_flash_attention_2=use_flash_attention_2,
+        use_4bit=use_4bit,
+        use_8bit=use_8bit,
+        torch_dtype=torch_dtype,
+        device_map=device_map,
+    )
+    return TextToAudioPipeline(config)
+def list_presets() -> dict[str, dict[str, str]]:
+    return dict(PRESETS)
+class TextToAudioPipeline:
+    """
+    Wrapper around transformers text-to-audio pipeline with GPU opts,
+    streaming-style chunked output, and memory profiling.
+    """
+    def __init__(self, config: PipelineConfig) -> None:
+        self.config = config
+        self._pipe: Any = None
+        self._device = _infer_device()
+    def _ensure_loaded(self) -> None:
+        if self._pipe is not None:
+            return
+        model_kwargs = _get_model_kwargs(self.config)
+        try:
+            self._pipe = hf_pipeline(
+                task="text-to-audio",
+                model=self.config.model_id,
+                model_kwargs=model_kwargs,
+                device=self._device if self.config.device_map != "auto" else None,
+            )
+        except (TypeError, ValueError) as e:
+            if "attn_implementation" in str(e) or "flash_attention_2" in str(e).lower():
+                model_kwargs.pop("attn_implementation", None)
+                self._pipe = hf_pipeline(
+                    task="text-to-audio",
+                    model=self.config.model_id,
+                    model_kwargs=model_kwargs,
+                    device=self._device if self.config.device_map != "auto" else None,
+                )
+            else:
+                raise e
+    def generate(
+        self,
+        text: str | list[str],
+        forward_params: dict[str, Any] | None = None,
+        generate_kwargs: dict[str, Any] | None = None,
+    ) -> dict[str, Any] | list[dict[str, Any]]:
+        self._ensure_loaded()
+        forward_params = forward_params or {}
+        generate_kwargs = generate_kwargs or {}
+        if generate_kwargs:
+            forward_params["generate_kwargs"] = generate_kwargs
+        out = self._pipe(text, **forward_params)
+        return out
+    def generate_with_profile(
+        self,
+        text: str,
+        forward_params: dict[str, Any] | None = None,
+        generate_kwargs: dict[str, Any] | None = None,
+    ) -> tuple[dict[str, Any], dict[str, float]]:
+        self._ensure_loaded()
+        if torch.cuda.is_available():
+            torch.cuda.reset_peak_memory_stats()
+            torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        result = self.generate(text, forward_params=forward_params, generate_kwargs=generate_kwargs)
+        if torch.cuda.is_available():
+            torch.cuda.synchronize()
+        elapsed = time.perf_counter() - t0
+        profile: dict[str, float] = {"time_s": elapsed}
+        if torch.cuda.is_available():
+            profile["vram_peak_mb"] = torch.cuda.max_memory_allocated() / (1024 * 1024)
+        single = result if isinstance(result, dict) else result[0]
+        if isinstance(single, dict) and "audio" in single:
+            sr = single.get("sampling_rate", 24000)
+            duration_s = single["audio"].size / max(sr, 1) if hasattr(single["audio"], "size") else 0.0
+            if duration_s > 0:
+                profile["rtf"] = elapsed / duration_s
+                profile["duration_s"] = duration_s
+        return result, profile
+    def stream_chunks(
+        self,
+        text: str,
+        chunk_duration_s: float = 0.5,
+        forward_params: dict[str, Any] | None = None,
+        generate_kwargs: dict[str, Any] | None = None,
+    ) -> Generator[tuple[np.ndarray, int], None, None]:
+        self._ensure_loaded()
+        out, _ = self.generate_with_profile(text, forward_params=forward_params, generate_kwargs=generate_kwargs)
+        single = out if isinstance(out, dict) else out[0]
+        audio = single["audio"]
+        sr = single["sampling_rate"]
+        if hasattr(audio, "numpy"):
+            arr = audio.numpy()
+        else:
+            arr = np.asarray(audio)
+        if arr.ndim == 1:
+            arr = arr[np.newaxis, :]
+        samples_per_chunk = int(chunk_duration_s * sr)
+        for start in range(0, arr.shape[-1], samples_per_chunk):
+            chunk = arr[..., start : start + samples_per_chunk]
+            if chunk.size == 0:
+                break
+            yield np.squeeze(chunk), sr
+    @property
+    def sampling_rate(self) -> int:
+        self._ensure_loaded()
+        return getattr(self._pipe, "sampling_rate", 24000)
+def stream_audio_to_file(
+    chunk_iter: Iterator[tuple[np.ndarray, int]],
+    path: str,
+    subtype: str = "PCM_16",
+) -> None:
+    import soundfile as sf
+    first_chunk, sr = next(chunk_iter)
+    with sf.SoundFile(path, "w", samplerate=sr, channels=1 if first_chunk.ndim == 1 else first_chunk.shape[0], subtype=subtype) as f:
+        f.write(first_chunk)
+        for chunk, _ in chunk_iter:
+            f.write(chunk)

tests/test_pipeline.py ADDED Viewed

	@@ -0,0 +1,39 @@

+"""Unit tests for text_to_audio pipeline (no GPU or model download required)."""
+from __future__ import annotations
+import pytest
+from src.text_to_audio import list_presets, build_pipeline, TextToAudioPipeline
+from src.text_to_audio.pipeline import PipelineConfig, _get_model_kwargs, PRESETS
+def test_list_presets() -> None:
+    presets = list_presets()
+    assert "csm-1b" in presets
+    assert "bark-small" in presets
+    assert presets["csm-1b"]["model_id"] == "sesame/csm-1b"
+def test_build_pipeline_returns_wrapper() -> None:
+    pipe = build_pipeline(preset="csm-1b")
+    assert isinstance(pipe, TextToAudioPipeline)
+    assert pipe.config.model_id == "sesame/csm-1b"
+def test_build_pipeline_custom_model_id() -> None:
+    pipe = build_pipeline(model_id="suno/bark-small")
+    assert pipe.config.model_id == "suno/bark-small"
+def test_config_model_kwargs_4bit() -> None:
+    config = PipelineConfig(model_id="test", use_4bit=True, use_flash_attention_2=False)
+    kwargs = _get_model_kwargs(config)
+    assert "quantization_config" in kwargs
+    assert kwargs["torch_dtype"] is not None
+def test_config_model_kwargs_flash_attn() -> None:
+    config = PipelineConfig(model_id="test", use_4bit=False, use_flash_attention_2=True)
+    kwargs = _get_model_kwargs(config)
+    assert kwargs.get("attn_implementation") == "flash_attention_2"