Spaces:

OnyxMunk
/

DAW_Sampler_Invader

Runtime error

App Files Files Community

DAW_Sampler_Invader / README.md

Keith

Update SDK version and app.py for HF stability

ab80cc2 27 days ago

preview code

raw

history blame contribute delete

8.33 kB

	---
	title: MusicSampler
	emoji: 🎹
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 4.44.1
	app_file: app.py
	pinned: false
	license: mit
	short_description: Modular AudioGenerator for DAW-INVADER
	---

	# TransformerPrime Text-to-Audio Pipeline

	One-sentence justification: We use the Hugging Face `text-to-audio` pipeline with sesame/csm-1b (1B-param conversational TTS, Llama backbone, voice cloning) as the default; it is natively supported in transformers, fits in <8–10 GB VRAM with 4-bit quantization, and supports `device_map="auto"`, bfloat16, and optional Flash Attention 2 for fast inference on RTX 40/50 and A100/H100/B200.

	---

	## Pipeline flow (ASCII)

	```
	┌─────────────┐ ┌──────────────────┐ ┌─────────────────────────────────────┐
	│ Text │────▶│ HF Processor / │────▶│ AutoModelForTextToWaveform │
	│ input │ │ Tokenizer │ │ (CSM/Bark/SpeechT5/MusicGen/…) │
	└─────────────┘ └──────────────────┘ │ • device_map="auto" │
	│ • torch.bfloat16 │
	│ • attn_implementation=flash_attn_2 │
	│ • optional: 4-bit (bitsandbytes) │
	└──────────────────┬───────────────────┘
	│
	┌──────────────────────────────────────────┘
	▼
	┌──────────────────────┐ ┌─────────────────────┐
	│ Spectrogram models │────▶│ Vocoder (HiFi-GAN) │
	│ (e.g. SpeechT5) │ │ → waveform │
	└──────────────────────┘ └──────────┬──────────┘
	│ │
	│ Waveform models (CSM, Bark, MusicGen)
	└─────────────────────────────┼─────────────┐
	▼ │
	┌───────────────┐ │
	│ Waveform │◀─────┘
	│ (numpy/CPU) │
	└───────┬───────┘
	│
	┌────────────────────────────┼────────────────────────────┐
	▼ ▼ ▼
	stream_chunks() save WAV (soundfile) Gradio / CLI
	(fixed-duration slices) + memory profile demo
	```

	---

	## Install

	```bash
	cd c:\Users\Keith\transformerprime
	pip install -r requirements.txt
	```

	Optional: `pip install bitsandbytes` for 4-bit/8-bit (keeps VRAM <10 GB on 1B models).

	Run tests (from repo root): `PYTHONPATH=. pytest tests/`

	---

	## Usage

	### Python API

	```python
	from src.text_to_audio import build_pipeline, TextToAudioPipeline
	import soundfile as sf

	# Default: CSM-1B, bfloat16, device_map="auto", Flash Attention 2 when supported
	pipe = build_pipeline(preset="csm-1b")

	# Low VRAM: 4-bit quantization
	pipe_low = build_pipeline(preset="csm-1b", use_4bit=True)

	# Generate and profile (time, RTF, VRAM peak)
	out, profile = pipe.generate_with_profile("Hello from TransformerPrime. This runs on GPU.")
	sf.write("out.wav", out["audio"], out["sampling_rate"])
	print(profile) # {"time_s": ..., "rtf": ..., "vram_peak_mb": ..., "duration_s": ...}

	# Streaming-style chunks (post-generation chunking for playback)
	for chunk, sr in pipe.stream_chunks("Long text here.", chunk_duration_s=0.5):
	pass # play or write chunk
	```

	### CLI

	```bash
	python demo.py --cli --text "Your sentence here." --output out.wav
	python demo.py --cli --model bark-small --quantize --output bark.wav
	```

	### Gradio

	```bash
	python demo.py --model csm-1b
	# Open http://localhost:7860
	```

	---

	## Expected performance (typical GPU)

	\| Preset \| GPU (example) \| dtype \| VRAM (approx) \| RTF (approx) \| Note \|
	\|-------------\|---------------\|----------\|----------------\|--------------\|-------------------------\|
	\| csm-1b \| RTX 4090 \| bfloat16 \| ~8–12 GB \| 0.1–0.3 \| 1B, voice cloning \|
	\| csm-1b \| RTX 4090 \| 4-bit \| ~6–8 GB \| 0.15–0.4 \| Same, lower VRAM \|
	\| bark-small \| RTX 4070 \| float32 \| ~4–6 GB \| 0.2–0.5 \| Smaller, multilingual \|
	\| speecht5 \| RTX 4060 \| float32 \| ~2–3 GB \| <0.2 \| Spectrogram + vocoder \|
	\| musicgen-small \| RTX 4090 \| float32 \| ~8–10 GB \| 0.3–0.6 \| Music/sfx \|

	RTF = real-time factor (wall time / audio duration; <1 = faster than real time).

	---

	## Tuning tips

	- Batch inference: Pass a list of strings: `pipe.generate(["Text one.", "Text two."])`. Watch VRAM; reduce batch size or use 4-bit if OOM.
	- Longer generation: Increase `max_new_tokens` in `generate_kwargs` (e.g. `pipe.generate(text, generate_kwargs={"max_new_tokens": 512})`). CSM/Bark/MusicGen use this; exact cap is model-specific.
	- Emotion / style:
	- CSM: Use chat template with reference audio for voice cloning and tone.
	- Bark: Use semantic tokens / history if you customize inputs.
	- Dia (nari-labs/Dia-1.6B-0626): Speaker tags `[S1]`/`[S2]` and non-verbal cues like `(laughs)`.
	- Stability: Prefer `torch.bfloat16` on Ampere+; use `use_4bit=True` to stay under 8–10 GB VRAM.
	- CPU fallback: Omit `device_map="auto"` and set `device=-1` in the pipeline if you have no GPU; generation will be slow.

	---

	## Error handling and memory

	- OOM: Enable 4-bit (`use_4bit=True`) or switch to a smaller preset (e.g. `bark-small`, `speecht5`).
	- Flash Attention 2: If loading fails with an attn backend error, the code falls back to the default attention implementation for that model.
	- VRAM check: `generate_with_profile()` returns `vram_peak_mb` when CUDA is available; use it to size batches and model.

	---

	## Presets

	\| Preset \| Model ID \| Description \|
	\|----------------\|------------------------\|--------------------------------------\|
	\| csm-1b \| sesame/csm-1b \| 1B conversational TTS, voice cloning \|
	\| bark-small \| suno/bark-small \| Multilingual, non-verbal \|
	\| speecht5 \| microsoft/speecht5_tts\| Multi-speaker (x-vector) \|
	\| musicgen-small \| facebook/musicgen-small \| Music / SFX \|

	You can pass any `model_id` supported by `transformers` text-to-audio pipeline via `build_pipeline(model_id="org/model-name")`.

	---

	## Code layout

	- `src/text_to_audio/pipeline.py` — Pipeline wrapper, GPU opts, streaming chunks, memory profiling.
	- `src/text_to_audio/__init__.py` — Exposes `build_pipeline`, `TextToAudioPipeline`, `list_presets`.
	- `demo.py` — Gradio UI and CLI; `python demo.py` (Gradio) or `python demo.py --cli --text "..." --output out.wav`.
	- `tests/test_pipeline.py` — Unit tests (no model download); run after `pip install -r requirements.txt`.