File size: 8,327 Bytes
e3f3734
 
 
 
 
 
ab80cc2
e3f3734
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---

title: MusicSampler
emoji: 🎹
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: Modular AudioGenerator for DAW-INVADER
---


# TransformerPrime Text-to-Audio Pipeline

**One-sentence justification:** We use the Hugging Face `text-to-audio` pipeline with **sesame/csm-1b** (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in **<8–10 GB VRAM** with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.

---

## Pipeline flow (ASCII)

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚   Text      │────▢│  HF Processor /  │────▢│  AutoModelForTextToWaveform          β”‚

β”‚   input     β”‚     β”‚  Tokenizer       β”‚     β”‚  (CSM/Bark/SpeechT5/MusicGen/…)       β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β€’ device_map="auto"                  β”‚

                                            β”‚  β€’ torch.bfloat16                     β”‚

                                            β”‚  β€’ attn_implementation=flash_attn_2   β”‚

                                            β”‚  β€’ optional: 4-bit (bitsandbytes)     β”‚

                                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                                                               β”‚

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    β–Ό

         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

         β”‚ Spectrogram models    │────▢│ Vocoder (HiFi-GAN)  β”‚

         β”‚ (e.g. SpeechT5)       β”‚     β”‚ β†’ waveform          β”‚

         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    β”‚                             β”‚

                    β”‚  Waveform models (CSM, Bark, MusicGen)

                    └─────────────────────────────┼─────────────┐

                                                  β–Ό             β”‚

                                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚

                                         β”‚  Waveform     β”‚β—€β”€β”€β”€β”€β”€β”˜

                                         β”‚  (numpy/CPU)  β”‚

                                         β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜

                                                 β”‚

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

                    β–Ό                            β–Ό                            β–Ό

             stream_chunks()              save WAV (soundfile)         Gradio / CLI

             (fixed-duration slices)      + memory profile             demo

```

---

## Install

```bash

cd c:\Users\Keith\transformerprime

pip install -r requirements.txt

```

Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM <10 GB on 1B models).

Run tests (from repo root): `PYTHONPATH=. pytest tests/`

---

## Usage

### Python API

```python

from src.text_to_audio import build_pipeline, TextToAudioPipeline

import soundfile as sf



# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported

pipe = build_pipeline(preset="csm-1b")



# Low VRAM: 4-bit quantization

pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)



# Generate and profile (time, RTF, VRAM peak)

out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")

sf.write("out.wav", out["audio"], out["sampling_rate"])

print(profile)  # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}



# Streaming-style chunks (post-generation chunking for playback)

for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):

    pass  # play or write chunk

```

### CLI

```bash

python demo.py --cli --text "Your sentence here." --output out.wav

python demo.py --cli --model bark-small --quantize --output bark.wav

```

### Gradio

```bash

python demo.py --model csm-1b

# Open http://localhost:7860

```

---

## Expected performance (typical GPU)

| Preset       | GPU (example) | dtype    | VRAM (approx) | RTF (approx) | Note                    |
|-------------|---------------|----------|----------------|--------------|-------------------------|
| csm-1b      | RTX 4090     | bfloat16 | ~8–12 GB      | 0.1–0.3      | 1B, voice cloning       |
| csm-1b      | RTX 4090     | 4-bit    | ~6–8 GB       | 0.15–0.4     | Same, lower VRAM        |
| bark-small  | RTX 4070     | float32  | ~4–6 GB       | 0.2–0.5      | Smaller, multilingual   |
| speecht5    | RTX 4060     | float32  | ~2–3 GB       | <0.2        | Spectrogram + vocoder   |
| musicgen-small | RTX 4090  | float32  | ~8–10 GB      | 0.3–0.6      | Music/sfx               |

*RTF = real-time factor (wall time / audio duration; <1 = faster than real time).*

---

## Tuning tips

- **Batch inference:** Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM.
- **Longer generation:** Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific.
- **Emotion / style:**  
  - **CSM:** Use chat template with reference audio for voice cloning and tone.  
  - **Bark:** Use semantic tokens / history if you customize inputs.  
  - **Dia (nari-labs/Dia-1.6B-0626):** Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`.
- **Stability:** Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8–10 GB VRAM.
- **CPU fallback:** Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow.

---

## Error handling and memory

- **OOM:** Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`).
- **Flash Attention 2:** If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
- **VRAM check:** `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model.

---

## Presets

| Preset         | Model ID              | Description                          |
|----------------|------------------------|--------------------------------------|
| csm-1b         | sesame/csm-1b         | 1B conversational TTS, voice cloning |
| bark-small     | suno/bark-small       | Multilingual, non-verbal             |
| speecht5       | microsoft/speecht5_tts| Multi-speaker (x-vector)             |

| musicgen-small | facebook/musicgen-small | Music / SFX                        |



You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`.

---

## Code layout

- `src/text_to_audio/pipeline.py` β€” Pipeline wrapper, GPU opts, streaming chunks, memory profiling.
- `src/text_to_audio/__init__.py` β€” Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`.
- `demo.py` β€” Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`.
- `tests/test_pipeline.py` β€” Unit tests (no model download); run after `pip install -r requirements.txt`.