Spaces:
Runtime error
Runtime error
File size: 8,327 Bytes
e3f3734 ab80cc2 e3f3734 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | ---
title: MusicSampler
emoji: πΉ
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: Modular AudioGenerator for DAW-INVADER
---
# TransformerPrime Text-to-Audio Pipeline
**One-sentence justification:** We use the Hugging Face `text-to-audio` pipeline with **sesame/csm-1b** (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in **<8β10 GB VRAM** with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.
---
## Pipeline flow (ASCII)
```
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β Text ββββββΆβ HF Processor / ββββββΆβ AutoModelForTextToWaveform β
β input β β Tokenizer β β (CSM/Bark/SpeechT5/MusicGen/β¦) β
βββββββββββββββ ββββββββββββββββββββ β β’ device_map="auto" β
β β’ torch.bfloat16 β
β β’ attn_implementation=flash_attn_2 β
β β’ optional: 4-bit (bitsandbytes) β
ββββββββββββββββββββ¬ββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββ βββββββββββββββββββββββ
β Spectrogram models ββββββΆβ Vocoder (HiFi-GAN) β
β (e.g. SpeechT5) β β β waveform β
ββββββββββββββββββββββββ ββββββββββββ¬βββββββββββ
β β
β Waveform models (CSM, Bark, MusicGen)
βββββββββββββββββββββββββββββββΌββββββββββββββ
βΌ β
βββββββββββββββββ β
β Waveform ββββββββ
β (numpy/CPU) β
βββββββββ¬ββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
βΌ βΌ βΌ
stream_chunks() save WAV (soundfile) Gradio / CLI
(fixed-duration slices) + memory profile demo
```
---
## Install
```bash
cd c:\Users\Keith\transformerprime
pip install -r requirements.txt
```
Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM <10 GB on 1B models).
Run tests (from repo root): `PYTHONPATH=. pytest tests/`
---
## Usage
### Python API
```python
from src.text_to_audio import build_pipeline, TextToAudioPipeline
import soundfile as sf
# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported
pipe = build_pipeline(preset="csm-1b")
# Low VRAM: 4-bit quantization
pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)
# Generate and profile (time, RTF, VRAM peak)
out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")
sf.write("out.wav", out["audio"], out["sampling_rate"])
print(profile) # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}
# Streaming-style chunks (post-generation chunking for playback)
for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):
pass # play or write chunk
```
### CLI
```bash
python demo.py --cli --text "Your sentence here." --output out.wav
python demo.py --cli --model bark-small --quantize --output bark.wav
```
### Gradio
```bash
python demo.py --model csm-1b
# Open http://localhost:7860
```
---
## Expected performance (typical GPU)
| Preset | GPU (example) | dtype | VRAM (approx) | RTF (approx) | Note |
|-------------|---------------|----------|----------------|--------------|-------------------------|
| csm-1b | RTX 4090 | bfloat16 | ~8β12 GB | 0.1β0.3 | 1B, voice cloning |
| csm-1b | RTX 4090 | 4-bit | ~6β8 GB | 0.15β0.4 | Same, lower VRAM |
| bark-small | RTX 4070 | float32 | ~4β6 GB | 0.2β0.5 | Smaller, multilingual |
| speecht5 | RTX 4060 | float32 | ~2β3 GB | <0.2 | Spectrogram + vocoder |
| musicgen-small | RTX 4090 | float32 | ~8β10 GB | 0.3β0.6 | Music/sfx |
*RTF = real-time factor (wall time / audio duration; <1 = faster than real time).*
---
## Tuning tips
- **Batch inference:** Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM.
- **Longer generation:** Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific.
- **Emotion / style:**
- **CSM:** Use chat template with reference audio for voice cloning and tone.
- **Bark:** Use semantic tokens / history if you customize inputs.
- **Dia (nari-labs/Dia-1.6B-0626):** Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`.
- **Stability:** Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8β10 GB VRAM.
- **CPU fallback:** Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow.
---
## Error handling and memory
- **OOM:** Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`).
- **Flash Attention 2:** If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
- **VRAM check:** `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model.
---
## Presets
| Preset | Model ID | Description |
|----------------|------------------------|--------------------------------------|
| csm-1b | sesame/csm-1b | 1B conversational TTS, voice cloning |
| bark-small | suno/bark-small | Multilingual, non-verbal |
| speecht5 | microsoft/speecht5_tts| Multi-speaker (x-vector) |
| musicgen-small | facebook/musicgen-small | Music / SFX |
You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`.
---
## Code layout
- `src/text_to_audio/pipeline.py` β Pipeline wrapper, GPU opts, streaming chunks, memory profiling.
- `src/text_to_audio/__init__.py` β Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`.
- `demo.py` β Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`.
- `tests/test_pipeline.py` β Unit tests (no model download); run after `pip install -r requirements.txt`.
|