Spaces:
Runtime error
Runtime error
| title: MusicSampler | |
| emoji: πΉ | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 4.44.1 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Modular AudioGenerator for DAW-INVADER | |
| # TransformerPrime Text-to-Audio Pipeline | |
| **One-sentence justification:** We use the Hugging Face `text-to-audio` pipeline with **sesame/csm-1b** (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in **<8β10 GB VRAM** with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200. | |
| --- | |
| ## Pipeline flow (ASCII) | |
| ``` | |
| βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ | |
| β Text ββββββΆβ HF Processor / ββββββΆβ AutoModelForTextToWaveform β | |
| β input β β Tokenizer β β (CSM/Bark/SpeechT5/MusicGen/β¦) β | |
| βββββββββββββββ ββββββββββββββββββββ β β’ device_map="auto" β | |
| β β’ torch.bfloat16 β | |
| β β’ attn_implementation=flash_attn_2 β | |
| β β’ optional: 4-bit (bitsandbytes) β | |
| ββββββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| ββββββββββββββββββββββββββββββββββββββββββββ | |
| βΌ | |
| ββββββββββββββββββββββββ βββββββββββββββββββββββ | |
| β Spectrogram models ββββββΆβ Vocoder (HiFi-GAN) β | |
| β (e.g. SpeechT5) β β β waveform β | |
| ββββββββββββββββββββββββ ββββββββββββ¬βββββββββββ | |
| β β | |
| β Waveform models (CSM, Bark, MusicGen) | |
| βββββββββββββββββββββββββββββββΌββββββββββββββ | |
| βΌ β | |
| βββββββββββββββββ β | |
| β Waveform ββββββββ | |
| β (numpy/CPU) β | |
| βββββββββ¬ββββββββ | |
| β | |
| ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ | |
| βΌ βΌ βΌ | |
| stream_chunks() save WAV (soundfile) Gradio / CLI | |
| (fixed-duration slices) + memory profile demo | |
| ``` | |
| --- | |
| ## Install | |
| ```bash | |
| cd c:\Users\Keith\transformerprime | |
| pip install -r requirements.txt | |
| ``` | |
| Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM <10 GB on 1B models). | |
| Run tests (from repo root): `PYTHONPATH=. pytest tests/` | |
| --- | |
| ## Usage | |
| ### Python API | |
| ```python | |
| from src.text_to_audio import build_pipeline, TextToAudioPipeline | |
| import soundfile as sf | |
| # Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported | |
| pipe = build_pipeline(preset="csm-1b") | |
| # Low VRAM: 4-bit quantization | |
| pipe_low = build_pipeline(preset="csm-1b", use_4bit=True) | |
| # Generate and profile (time, RTF, VRAM peak) | |
| out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.") | |
| sf.write("out.wav", out["audio"], out["sampling_rate"]) | |
| print(profile) # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...} | |
| # Streaming-style chunks (post-generation chunking for playback) | |
| for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5): | |
| pass # play or write chunk | |
| ``` | |
| ### CLI | |
| ```bash | |
| python demo.py --cli --text "Your sentence here." --output out.wav | |
| python demo.py --cli --model bark-small --quantize --output bark.wav | |
| ``` | |
| ### Gradio | |
| ```bash | |
| python demo.py --model csm-1b | |
| # Open http://localhost:7860 | |
| ``` | |
| --- | |
| ## Expected performance (typical GPU) | |
| | Preset | GPU (example) | dtype | VRAM (approx) | RTF (approx) | Note | | |
| |-------------|---------------|----------|----------------|--------------|-------------------------| | |
| | csm-1b | RTX 4090 | bfloat16 | ~8β12 GB | 0.1β0.3 | 1B, voice cloning | | |
| | csm-1b | RTX 4090 | 4-bit | ~6β8 GB | 0.15β0.4 | Same, lower VRAM | | |
| | bark-small | RTX 4070 | float32 | ~4β6 GB | 0.2β0.5 | Smaller, multilingual | | |
| | speecht5 | RTX 4060 | float32 | ~2β3 GB | <0.2 | Spectrogram + vocoder | | |
| | musicgen-small | RTX 4090 | float32 | ~8β10 GB | 0.3β0.6 | Music/sfx | | |
| *RTF = real-time factor (wall time / audio duration; <1 = faster than real time).* | |
| --- | |
| ## Tuning tips | |
| - **Batch inference:** Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM. | |
| - **Longer generation:** Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific. | |
| - **Emotion / style:** | |
| - **CSM:** Use chat template with reference audio for voice cloning and tone. | |
| - **Bark:** Use semantic tokens / history if you customize inputs. | |
| - **Dia (nari-labs/Dia-1.6B-0626):** Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`. | |
| - **Stability:** Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8β10 GB VRAM. | |
| - **CPU fallback:** Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow. | |
| --- | |
| ## Error handling and memory | |
| - **OOM:** Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`). | |
| - **Flash Attention 2:** If loading fails with an attn backend error, the code falls back to the default attention implementation for that model. | |
| - **VRAM check:** `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model. | |
| --- | |
| ## Presets | |
| | Preset | Model ID | Description | | |
| |----------------|------------------------|--------------------------------------| | |
| | csm-1b | sesame/csm-1b | 1B conversational TTS, voice cloning | | |
| | bark-small | suno/bark-small | Multilingual, non-verbal | | |
| | speecht5 | microsoft/speecht5_tts| Multi-speaker (x-vector) | | |
| | musicgen-small | facebook/musicgen-small | Music / SFX | | |
| You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`. | |
| --- | |
| ## Code layout | |
| - `src/text_to_audio/pipeline.py` β Pipeline wrapper, GPU opts, streaming chunks, memory profiling. | |
| - `src/text_to_audio/__init__.py` β Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`. | |
| - `demo.py` β Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`. | |
| - `tests/test_pipeline.py` β Unit tests (no model download); run after `pip install -r requirements.txt`. | |