DAW_Sampler_Invader / README.md
Keith
Update SDK version and app.py for HF stability
ab80cc2

A newer version of the Gradio SDK is available: 6.11.0

Upgrade
metadata
title: MusicSampler
emoji: 🎹
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: Modular AudioGenerator for DAW-INVADER

TransformerPrime Text-to-Audio Pipeline

One-sentence justification: We use the Hugging Face text-to-audio pipeline with sesame/csm-1b (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in <8–10 GB VRAM with 4-bit quantization, and supports device_map="auto", bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.


Pipeline flow (ASCII)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Text      │────▢│  HF Processor /  │────▢│  AutoModelForTextToWaveform          β”‚
β”‚   input     β”‚     β”‚  Tokenizer       β”‚     β”‚  (CSM/Bark/SpeechT5/MusicGen/…)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β€’ device_map="auto"                  β”‚
                                            β”‚  β€’ torch.bfloat16                     β”‚
                                            β”‚  β€’ attn_implementation=flash_attn_2   β”‚
                                            β”‚  β€’ optional: 4-bit (bitsandbytes)     β”‚
                                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Spectrogram models    │────▢│ Vocoder (HiFi-GAN)  β”‚
         β”‚ (e.g. SpeechT5)       β”‚     β”‚ β†’ waveform          β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚                             β”‚
                    β”‚  Waveform models (CSM, Bark, MusicGen)
                    └─────────────────────────────┼─────────────┐
                                                  β–Ό             β”‚
                                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
                                         β”‚  Waveform     β”‚β—€β”€β”€β”€β”€β”€β”˜
                                         β”‚  (numpy/CPU)  β”‚
                                         β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β–Ό                            β–Ό                            β–Ό
             stream_chunks()              save WAV (soundfile)         Gradio / CLI
             (fixed-duration slices)      + memory profile             demo

Install

cd c:\Users\Keith\transformerprime
pip install -r requirements.txt

Optional: pip install bitsandbytes for 4-bit/8-bit (keeps VRAM <10 GB on 1B models).

Run tests (from repo root): PYTHONPATH=. pytest tests/


Usage

Python API

from src.text_to_audio import build_pipeline, TextToAudioPipeline
import soundfile as sf

# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported
pipe = build_pipeline(preset="csm-1b")

# Low VRAM: 4-bit quantization
pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)

# Generate and profile (time, RTF, VRAM peak)
out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")
sf.write("out.wav", out["audio"], out["sampling_rate"])
print(profile)  # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}

# Streaming-style chunks (post-generation chunking for playback)
for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):
    pass  # play or write chunk

CLI

python demo.py --cli --text "Your sentence here." --output out.wav
python demo.py --cli --model bark-small --quantize --output bark.wav

Gradio

python demo.py --model csm-1b
# Open http://localhost:7860

Expected performance (typical GPU)

Preset GPU (example) dtype VRAM (approx) RTF (approx) Note
csm-1b RTX 4090 bfloat16 ~8–12 GB 0.1–0.3 1B, voice cloning
csm-1b RTX 4090 4-bit ~6–8 GB 0.15–0.4 Same, lower VRAM
bark-small RTX 4070 float32 ~4–6 GB 0.2–0.5 Smaller, multilingual
speecht5 RTX 4060 float32 ~2–3 GB <0.2 Spectrogram + vocoder
musicgen-small RTX 4090 float32 ~8–10 GB 0.3–0.6 Music/sfx

RTF = real-time factor (wall time / audio duration; <1 = faster than real time).


Tuning tips

  • Batch inference: Pass a list of strings: pipe.generate(["Text one.", "Text two."]). Watch VRAM; reduce batch size or use 4-bit if OOM.
  • Longer generation: Increase max_new_tokens in generate_kwargs (e.g. pipe.generate(text, generate_kwargs={"max_new_tokens": 512})). CSM/Bark/MusicGen use this; exact cap is model-specific.
  • Emotion / style:
    • CSM: Use chat template with reference audio for voice cloning and tone.
    • Bark: Use semantic tokens / history if you customize inputs.
    • Dia (nari-labs/Dia-1.6B-0626): Speaker tags [S1]/[S2] and non-verbal cues like (laughs).
  • Stability: Prefer torch.bfloat16 on Ampere+; use use_4bit=True to stay under 8–10 GB VRAM.
  • CPU fallback: Omit device_map="auto" and set device=-1 in the pipeline if you have no GPU; generation will be slow.

Error handling and memory

  • OOM: Enable 4-bit (use_4bit=True) or switch to a smaller preset (e.g. bark-small, speecht5).
  • Flash Attention 2: If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
  • VRAM check: generate_with_profile() returns vram_peak_mb when CUDA is available; use it to size batches and model.

Presets

Preset Model ID Description
csm-1b sesame/csm-1b 1B conversational TTS, voice cloning
bark-small suno/bark-small Multilingual, non-verbal
speecht5 microsoft/speecht5_tts Multi-speaker (x-vector)
musicgen-small facebook/musicgen-small Music / SFX

You can pass any model_id supported by transformers text-to-audio pipeline via build_pipeline(model_id="org/model-name").


Code layout

  • src/text_to_audio/pipeline.py β€” Pipeline wrapper, GPU opts, streaming chunks, memory profiling.
  • src/text_to_audio/__init__.py β€” Exposes build_pipeline, TextToAudioPipeline, list_presets.
  • demo.py β€” Gradio UI and CLI; python demo.py (Gradio) or python demo.py --cli --text "..." --output out.wav.
  • tests/test_pipeline.py β€” Unit tests (no model download); run after pip install -r requirements.txt.