Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.11.0
title: MusicSampler
emoji: πΉ
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
short_description: Modular AudioGenerator for DAW-INVADER
TransformerPrime Text-to-Audio Pipeline
One-sentence justification: We use the Hugging Face text-to-audio pipeline with sesame/csm-1b (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in <8β10 GB VRAM with 4-bit quantization, and supports device_map="auto", bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.
Pipeline flow (ASCII)
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β Text ββββββΆβ HF Processor / ββββββΆβ AutoModelForTextToWaveform β
β input β β Tokenizer β β (CSM/Bark/SpeechT5/MusicGen/β¦) β
βββββββββββββββ ββββββββββββββββββββ β β’ device_map="auto" β
β β’ torch.bfloat16 β
β β’ attn_implementation=flash_attn_2 β
β β’ optional: 4-bit (bitsandbytes) β
ββββββββββββββββββββ¬ββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββββββββββ βββββββββββββββββββββββ
β Spectrogram models ββββββΆβ Vocoder (HiFi-GAN) β
β (e.g. SpeechT5) β β β waveform β
ββββββββββββββββββββββββ ββββββββββββ¬βββββββββββ
β β
β Waveform models (CSM, Bark, MusicGen)
βββββββββββββββββββββββββββββββΌββββββββββββββ
βΌ β
βββββββββββββββββ β
β Waveform ββββββββ
β (numpy/CPU) β
βββββββββ¬ββββββββ
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
βΌ βΌ βΌ
stream_chunks() save WAV (soundfile) Gradio / CLI
(fixed-duration slices) + memory profile demo
Install
cd c:\Users\Keith\transformerprime
pip install -r requirements.txt
Optional: pip install bitsandbytes for 4-bit/8-bit (keeps VRAM <10 GB on 1B models).
Run tests (from repo root): PYTHONPATH=. pytest tests/
Usage
Python API
from src.text_to_audio import build_pipeline, TextToAudioPipeline
import soundfile as sf
# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported
pipe = build_pipeline(preset="csm-1b")
# Low VRAM: 4-bit quantization
pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)
# Generate and profile (time, RTF, VRAM peak)
out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")
sf.write("out.wav", out["audio"], out["sampling_rate"])
print(profile) # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}
# Streaming-style chunks (post-generation chunking for playback)
for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):
pass # play or write chunk
CLI
python demo.py --cli --text "Your sentence here." --output out.wav
python demo.py --cli --model bark-small --quantize --output bark.wav
Gradio
python demo.py --model csm-1b
# Open http://localhost:7860
Expected performance (typical GPU)
| Preset | GPU (example) | dtype | VRAM (approx) | RTF (approx) | Note |
|---|---|---|---|---|---|
| csm-1b | RTX 4090 | bfloat16 | ~8β12 GB | 0.1β0.3 | 1B, voice cloning |
| csm-1b | RTX 4090 | 4-bit | ~6β8 GB | 0.15β0.4 | Same, lower VRAM |
| bark-small | RTX 4070 | float32 | ~4β6 GB | 0.2β0.5 | Smaller, multilingual |
| speecht5 | RTX 4060 | float32 | ~2β3 GB | <0.2 | Spectrogram + vocoder |
| musicgen-small | RTX 4090 | float32 | ~8β10 GB | 0.3β0.6 | Music/sfx |
RTF = real-time factor (wall time / audio duration; <1 = faster than real time).
Tuning tips
- Batch inference: Pass a list of strings:
pipe.generate(["Text one.", "Text two."]). Watch VRAM; reduce batch size or use 4-bit if OOM. - Longer generation: Increase
max_new_tokensingenerate_kwargs(e.g.pipe.generate(text, generate_kwargs={"max_new_tokens": 512})). CSM/Bark/MusicGen use this; exact cap is model-specific. - Emotion / style:
- CSM: Use chat template with reference audio for voice cloning and tone.
- Bark: Use semantic tokens / history if you customize inputs.
- Dia (nari-labs/Dia-1.6B-0626): Speaker tags
[S1]/[S2]and non-verbal cues like(laughs).
- Stability: Prefer
torch.bfloat16on Ampere+; useuse_4bit=Trueto stay under 8β10 GB VRAM. - CPU fallback: Omit
device_map="auto"and setdevice=-1in the pipeline if you have no GPU; generation will be slow.
Error handling and memory
- OOM: Enable 4-bit (
use_4bit=True) or switch to a smaller preset (e.g.bark-small,speecht5). - Flash Attention 2: If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
- VRAM check:
generate_with_profile()returnsvram_peak_mbwhen CUDA is available; use it to size batches and model.
Presets
| Preset | Model ID | Description |
|---|---|---|
| csm-1b | sesame/csm-1b | 1B conversational TTS, voice cloning |
| bark-small | suno/bark-small | Multilingual, non-verbal |
| speecht5 | microsoft/speecht5_tts | Multi-speaker (x-vector) |
| musicgen-small | facebook/musicgen-small | Music / SFX |
You can pass any model_id supported by transformers text-to-audio pipeline via build_pipeline(model_id="org/model-name").
Code layout
src/text_to_audio/pipeline.pyβ Pipeline wrapper, GPU opts, streaming chunks, memory profiling.src/text_to_audio/__init__.pyβ Exposesbuild_pipeline,TextToAudioPipeline,list_presets.demo.pyβ Gradio UI and CLI;python demo.py(Gradio) orpython demo.py --cli --text "..." --output out.wav.tests/test_pipeline.pyβ Unit tests (no model download); run afterpip install -r requirements.txt.