--- title: MusicSampler emoji: 🎹 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: false license: mit short_description: Modular AudioGenerator for DAW-INVADER --- # TransformerPrime Text-to-Audio Pipeline **One-sentence justification:** We use the Hugging Face `text-to-audio` pipeline with **sesame/csm-1b** (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in **<8–10 GB VRAM** with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200. --- ## Pipeline flow (ASCII) ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Text │────▢│ HF Processor / │────▢│ AutoModelForTextToWaveform β”‚ β”‚ input β”‚ β”‚ Tokenizer β”‚ β”‚ (CSM/Bark/SpeechT5/MusicGen/…) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β€’ device_map="auto" β”‚ β”‚ β€’ torch.bfloat16 β”‚ β”‚ β€’ attn_implementation=flash_attn_2 β”‚ β”‚ β€’ optional: 4-bit (bitsandbytes) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Spectrogram models │────▢│ Vocoder (HiFi-GAN) β”‚ β”‚ (e.g. SpeechT5) β”‚ β”‚ β†’ waveform β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ Waveform models (CSM, Bark, MusicGen) └─────────────────────────────┼─────────────┐ β–Ό β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ Waveform β”‚β—€β”€β”€β”€β”€β”€β”˜ β”‚ (numpy/CPU) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β–Ό β–Ό β–Ό stream_chunks() save WAV (soundfile) Gradio / CLI (fixed-duration slices) + memory profile demo ``` --- ## Install ```bash cd c:\Users\Keith\transformerprime pip install -r requirements.txt ``` Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM <10 GB on 1B models). Run tests (from repo root): `PYTHONPATH=. pytest tests/` --- ## Usage ### Python API ```python from src.text_to_audio import build_pipeline, TextToAudioPipeline import soundfile as sf # Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported pipe = build_pipeline(preset="csm-1b") # Low VRAM: 4-bit quantization pipe_low = build_pipeline(preset="csm-1b", use_4bit=True) # Generate and profile (time, RTF, VRAM peak) out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.") sf.write("out.wav", out["audio"], out["sampling_rate"]) print(profile) # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...} # Streaming-style chunks (post-generation chunking for playback) for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5): pass # play or write chunk ``` ### CLI ```bash python demo.py --cli --text "Your sentence here." --output out.wav python demo.py --cli --model bark-small --quantize --output bark.wav ``` ### Gradio ```bash python demo.py --model csm-1b # Open http://localhost:7860 ``` --- ## Expected performance (typical GPU) | Preset | GPU (example) | dtype | VRAM (approx) | RTF (approx) | Note | |-------------|---------------|----------|----------------|--------------|-------------------------| | csm-1b | RTX 4090 | bfloat16 | ~8–12 GB | 0.1–0.3 | 1B, voice cloning | | csm-1b | RTX 4090 | 4-bit | ~6–8 GB | 0.15–0.4 | Same, lower VRAM | | bark-small | RTX 4070 | float32 | ~4–6 GB | 0.2–0.5 | Smaller, multilingual | | speecht5 | RTX 4060 | float32 | ~2–3 GB | <0.2 | Spectrogram + vocoder | | musicgen-small | RTX 4090 | float32 | ~8–10 GB | 0.3–0.6 | Music/sfx | *RTF = real-time factor (wall time / audio duration; <1 = faster than real time).* --- ## Tuning tips - **Batch inference:** Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM. - **Longer generation:** Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific. - **Emotion / style:** - **CSM:** Use chat template with reference audio for voice cloning and tone. - **Bark:** Use semantic tokens / history if you customize inputs. - **Dia (nari-labs/Dia-1.6B-0626):** Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`. - **Stability:** Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8–10 GB VRAM. - **CPU fallback:** Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow. --- ## Error handling and memory - **OOM:** Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`). - **Flash Attention 2:** If loading fails with an attn backend error, the code falls back to the default attention implementation for that model. - **VRAM check:** `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model. --- ## Presets | Preset | Model ID | Description | |----------------|------------------------|--------------------------------------| | csm-1b | sesame/csm-1b | 1B conversational TTS, voice cloning | | bark-small | suno/bark-small | Multilingual, non-verbal | | speecht5 | microsoft/speecht5_tts| Multi-speaker (x-vector) | | musicgen-small | facebook/musicgen-small | Music / SFX | You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`. --- ## Code layout - `src/text_to_audio/pipeline.py` β€” Pipeline wrapper, GPU opts, streaming chunks, memory profiling. - `src/text_to_audio/__init__.py` β€” Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`. - `demo.py` β€” Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`. - `tests/test_pipeline.py` β€” Unit tests (no model download); run after `pip install -r requirements.txt`.