Spaces:

OnyxMunk
/

DAW_Sampler_Invader

Runtime error

File size: 8,327 Bytes

e3f3734
 
 
 
 
 
ab80cc2
e3f3734

---

title: MusicSampler
emoji: 🎹
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: Modular AudioGenerator for DAW-INVADER
---


# TransformerPrime Text-to-Audio Pipeline

**One-sentence justification:** We use the Hugging Face `text-to-audio` pipeline with **sesame/csm-1b** (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in **&lt;8–10 GB VRAM** with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.

---

## Pipeline flow (ASCII)

```

┌─────────────┐     ┌──────────────────┐     ┌─────────────────────────────────────┐

│   Text      │────▶│  HF Processor /  │────▶│  AutoModelForTextToWaveform          │

│   input     │     │  Tokenizer       │     │  (CSM/Bark/SpeechT5/MusicGen/…)       │

└─────────────┘     └──────────────────┘     │  • device_map="auto"                  │

                                            │  • torch.bfloat16                     │

                                            │  • attn_implementation=flash_attn_2   │

                                            │  • optional: 4-bit (bitsandbytes)     │

                                            └──────────────────┬───────────────────┘

                                                               │

                    ┌──────────────────────────────────────────┘

                    ▼

         ┌──────────────────────┐     ┌─────────────────────┐

         │ Spectrogram models    │────▶│ Vocoder (HiFi-GAN)  │

         │ (e.g. SpeechT5)       │     │ → waveform          │

         └──────────────────────┘     └──────────┬──────────┘

                    │                             │

                    │  Waveform models (CSM, Bark, MusicGen)

                    └─────────────────────────────┼─────────────┐

                                                  ▼             │

                                         ┌───────────────┐      │

                                         │  Waveform     │◀─────┘

                                         │  (numpy/CPU)  │

                                         └───────┬───────┘

                                                 │

                    ┌────────────────────────────┼────────────────────────────┐

                    ▼                            ▼                            ▼

             stream_chunks()              save WAV (soundfile)         Gradio / CLI

             (fixed-duration slices)      + memory profile             demo

```

---

## Install

```bash

cd c:\Users\Keith\transformerprime

pip install -r requirements.txt

```

Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM &lt;10 GB on 1B models).

Run tests (from repo root): `PYTHONPATH=. pytest tests/`

---

## Usage

### Python API

```python

from src.text_to_audio import build_pipeline, TextToAudioPipeline

import soundfile as sf



# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported

pipe = build_pipeline(preset="csm-1b")



# Low VRAM: 4-bit quantization

pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)



# Generate and profile (time, RTF, VRAM peak)

out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")

sf.write("out.wav", out["audio"], out["sampling_rate"])

print(profile)  # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}



# Streaming-style chunks (post-generation chunking for playback)

for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):

    pass  # play or write chunk

```

### CLI

```bash

python demo.py --cli --text "Your sentence here." --output out.wav

python demo.py --cli --model bark-small --quantize --output bark.wav

```

### Gradio

```bash

python demo.py --model csm-1b

# Open http://localhost:7860

```

---

## Expected performance (typical GPU)

| Preset       | GPU (example) | dtype    | VRAM (approx) | RTF (approx) | Note                    |
|-------------|---------------|----------|----------------|--------------|-------------------------|
| csm-1b      | RTX 4090     | bfloat16 | ~8–12 GB      | 0.1–0.3      | 1B, voice cloning       |
| csm-1b      | RTX 4090     | 4-bit    | ~6–8 GB       | 0.15–0.4     | Same, lower VRAM        |
| bark-small  | RTX 4070     | float32  | ~4–6 GB       | 0.2–0.5      | Smaller, multilingual   |
| speecht5    | RTX 4060     | float32  | ~2–3 GB       | &lt;0.2        | Spectrogram + vocoder   |
| musicgen-small | RTX 4090  | float32  | ~8–10 GB      | 0.3–0.6      | Music/sfx               |

*RTF = real-time factor (wall time / audio duration; &lt;1 = faster than real time).*

---

## Tuning tips

- **Batch inference:** Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM.
- **Longer generation:** Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific.
- **Emotion / style:**  
  - **CSM:** Use chat template with reference audio for voice cloning and tone.  
  - **Bark:** Use semantic tokens / history if you customize inputs.  
  - **Dia (nari-labs/Dia-1.6B-0626):** Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`.
- **Stability:** Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8–10 GB VRAM.
- **CPU fallback:** Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow.

---

## Error handling and memory

- **OOM:** Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`).
- **Flash Attention 2:** If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
- **VRAM check:** `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model.

---

## Presets

| Preset         | Model ID              | Description                          |
|----------------|------------------------|--------------------------------------|
| csm-1b         | sesame/csm-1b         | 1B conversational TTS, voice cloning |
| bark-small     | suno/bark-small       | Multilingual, non-verbal             |
| speecht5       | microsoft/speecht5_tts| Multi-speaker (x-vector)             |

| musicgen-small | facebook/musicgen-small | Music / SFX                        |



You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`.

---

## Code layout

- `src/text_to_audio/pipeline.py` — Pipeline wrapper, GPU opts, streaming chunks, memory profiling.
- `src/text_to_audio/__init__.py` — Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`.
- `demo.py` — Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`.
- `tests/test_pipeline.py` — Unit tests (no model download); run after `pip install -r requirements.txt`.