README.md · Ashiedu/Synesthesia at main

Synesthesia / README.md

Ashiedu

Update README.md

4a3604c verified 11 days ago

preview code

raw

history blame contribute delete

12.4 kB

	---
	license: apache-2.0
	task_categories:
	- audio-to-audio
	- text-to-audio
	- image-to-text
	tags:
	- music-generation
	- magenta
	- magenta-rt
	- onnx
	- burn
	- llama-cpp
	- performance-rnn
	- melody-rnn
	- drums-rnn
	- improv-rnn
	- polyphony-rnn
	- musicvae
	- groovae
	- piano-genie
	- ddsp
	- gansynth
	- nsynth
	- coconet
	- music-transformer
	- onsets-and-frames
	- spectrostream
	- musiccoca
	- synesthesia
	- directml
	- vulkan
	- wgpu
	- audio
	- midi
	language:
	- en
	library_name: onnxruntime
	base_model:
	- unsloth/gemma-3n-E2B-it
	- google/magenta-realtime
	---

	# Synesthesia — AI Music Models

	ONNX and GGUF model weights for [Synesthesia](https://github.com/kryptodogg/synesthesia),
	a cyber-physical synthesizer, 3D/4D signal workstation, and multi-modal music AI app.

	Synesthesia brings together every open-weights model from Magenta Classic and
	Magenta RT under one repo, exportable to ONNX for local inference and continuously
	fine-tunable via free Google Colab notebooks.

	---

	## Inference Runtimes

	\| Runtime \| Models \| Backend \| Notes \|
	\|---------\|--------\|---------\|-------\|
	\| Burn wgpu \| DDSP, GANSynth, NSynth, Piano Genie \| Vulkan / DX12 \| Pure Rust, no ROCm required \|
	\| ORT + DirectML \| RNN family, MusicVAE, Coconet, Onsets & Frames \| DirectML \| Fallback while Burn op coverage matures \|
	\| llama.cpp + Vulkan \| Gemma-3N \| Vulkan \| Same stack as LM Studio, GGUF format \|
	\| Magenta RT (JAX) \| Magenta RT LLM, SpectroStream, MusicCoCa \| TPU / GPU \| Free Colab TPU v2-8 for inference + finetuning \|

	Vulkan works on AMD without ROCm on Windows 11. All runtimes target the RX 6700 XT.

	---

	## Model Inventory

	### Magenta RT (Real-Time Audio Generation)

	Magenta RT is composed of three components working as a pipeline:
	SpectroStream (audio codec), MusicCoCa (style embeddings), and an encoder-decoder
	transformer LLM — the only open-weights model supporting real-time continuous
	musical audio generation.

	It is an 800 million parameter autoregressive transformer trained on
	~190k hours of stock music. It uses 38% fewer parameters
	than Stable Audio Open and 77% fewer than MusicGen Large.

	\| ID \| Model \| Format \| Task \| Synesthesia Role \|
	\|----\|-------\|--------\|------\|-----------------\|
	\| MRT-001 \| Magenta RT LLM \| JAX / ONNX \| Real-time stereo audio generation \| Continuous live generation engine \|
	\| MRT-002 \| SpectroStream Encoder \| ONNX \| Audio → discrete tokens (48kHz stereo, 25Hz, 64 RVQ) \| Audio tokenizer \|
	\| MRT-003 \| SpectroStream Decoder \| ONNX \| Tokens → 48kHz stereo audio \| Audio detokenizer \|
	\| MRT-004 \| MusicCoCa Text \| ONNX \| Text → 768-dim music embedding \| Text prompt → style control \|
	\| MRT-005 \| MusicCoCa Audio \| ONNX \| Audio → 768-dim music embedding \| Audio prompt → style control \|

	Finetuning: Free Colab TPU v2-8 via `Magenta_RT_Finetune.ipynb`. Customize to
	your own audio catalog. Official Colab demos support live generation,
	finetuning, and live audio injection (audio injection = mix user audio with model
	output and feed as context for next generation chunk).

	---

	### Magenta Classic — MIDI / Symbolic

	MusicRNN implements Magenta's LSTM-based language models:
	MelodyRNN, DrumsRNN, ImprovRNN, and PerformanceRNN.

	\| ID \| Model \| Format \| Task \| Synesthesia Role \|
	\|----\|-------\|--------\|------\|-----------------\|
	\| MC-001 \| Performance RNN \| ONNX \| Expressive MIDI performance generation \| AI arpeggiator, live note generation \|
	\| MC-002 \| Melody RNN \| ONNX \| Melody continuation (LSTM) \| Melody continuation tool \|
	\| MC-003 \| Drums RNN \| ONNX \| Drum pattern generation (LSTM) \| Beat generation \|
	\| MC-004 \| Improv RNN \| ONNX \| Chord-conditioned melody generation \| Live improv over chord progressions \|
	\| MC-005 \| Polyphony RNN \| ONNX \| Polyphonic music generation (BachBot) \| Harmonic voice generation \|
	\| MC-006 \| MusicVAE \| ONNX enc+dec \| Latent music VAE — melody, drum, trio loops \| Latent interpolation, style morphing \|
	\| MC-007 \| GrooVAE \| ONNX enc+dec \| Drum performance humanization \| Humanize MIDI drums \|
	\| MC-008 \| MidiMe \| ONNX \| Personalize MusicVAE in-session \| User-adaptive latent space \|
	\| MC-009 \| Music Transformer \| ONNX \| Long-form piano generation \| Extended composition \|
	\| MC-010 \| Coconet \| ONNX \| Counterpoint by convolution — complete partial scores \| Harmony / counterpoint filler \|

	---

	### Magenta Classic — Audio / Timbre

	\| ID \| Model \| Format \| Task \| Synesthesia Role \|
	\|----\|-------\|--------\|------\|-----------------\|
	\| MA-001 \| GANSynth \| ONNX \| GAN audio synthesis from NSynth timbres \| GANHarp-style timbre instrument \|
	\| MA-002 \| NSynth \| ONNX \| WaveNet neural audio synthesis \| Sample-level timbre generation \|
	\| MA-003 \| DDSP Encoder \| ONNX \| Audio → harmonic + noise params \| Timbre analysis \|
	\| MA-004 \| DDSP Decoder \| ONNX \| Harmonic params → audio \| Timbre resynthesis \|
	\| MA-005 \| Piano Genie \| ONNX \| 8-button → 88-key piano VQ-VAE \| Accessible piano performance \|
	\| MA-006 \| Onsets and Frames \| ONNX \| Polyphonic piano transcription (audio → MIDI) \| Audio → MIDI transcription \|
	\| MA-007 \| SPICE \| ONNX \| Pitch extraction from audio \| Monophonic pitch tracking \|

	---

	### LLM / Vision Control

	\| ID \| Model \| Format \| Task \| Synesthesia Role \|
	\|----\|-------\|--------\|------\|-----------------\|
	\| LV-001 \| Gemma-3N e2b-it \| GGUF \| Vision + text → structured JSON \| Camera → mood/energy/key control \|

	Format tiers:
	- `q4_k_m.gguf` — default (recommended, ~1.5GB)
	- `q2_k.gguf` — lite tier (fastest, smallest)
	- `f16.gguf` — full quality reference

	Runtime: `llama-cpp-v3` Rust crate with Vulkan backend.
	Same stack as LM Studio — no ROCm, no CUDA needed on Windows.

	---

	## Repository Structure

	```
	Ashiedu/Synesthesia/
	│
	├── manifest.json ← authoritative model registry
	│
	├── magenta_rt/
	│ ├── llm/ ← MRT-001: JAX checkpoint + ONNX export
	│ ├── spectrostream/
	│ │ ├── encoder_fp32.onnx
	│ │ ├── encoder_fp16.onnx
	│ │ ├── decoder_fp32.onnx
	│ │ └── decoder_fp16.onnx
	│ └── musiccoca/
	│ ├── text_fp32.onnx
	│ ├── text_fp16.onnx
	│ ├── audio_fp32.onnx
	│ └── audio_fp16.onnx
	│
	├── midi/
	│ ├── perfrnn/ ← MC-001: fp32 / fp16 / int8
	│ ├── melody_rnn/ ← MC-002
	│ ├── drums_rnn/ ← MC-003
	│ ├── improv_rnn/ ← MC-004
	│ ├── polyphony_rnn/ ← MC-005
	│ ├── musicvae/ ← MC-006: encoder + decoder
	│ ├── groovae/ ← MC-007
	│ ├── midime/ ← MC-008
	│ ├── music_transformer/ ← MC-009
	│ └── coconet/ ← MC-010
	│
	├── audio/
	│ ├── gansynth/ ← MA-001: fp32 / fp16
	│ ├── nsynth/ ← MA-002
	│ ├── ddsp/ ← MA-003+004: encoder + decoder
	│ ├── piano_genie/ ← MA-005
	│ ├── onsets_and_frames/ ← MA-006
	│ └── spice/ ← MA-007
	│
	└── llm/
	└── gemma3n_e2b/
	├── q4_k_m.gguf ← LV-001: default
	├── q2_k.gguf
	└── f16.gguf
	```

	Each subdirectory contains a `README.md` with input/output shapes,
	export commands, and Burn compatibility status.

	---

	## Quality Tiers (ONNX models)

	\| Tier \| Suffix \| VRAM est. \| Use case \|
	\|------\|--------\|-----------\|----------\|
	\| Full \| `_fp32.onnx` \| ~2–4× Half \| Reference quality, CI validation \|
	\| Half \| `_fp16.onnx` \| Baseline \| Default — recommended for RX 6700 XT \|
	\| Lite \| `_int8.onnx` \| ~0.5× Half \| Lowest latency (MIDI models only) \|

	---

	## Pulling Models in Rust

	```rust
	use hf_hub::api::sync::Api;

	pub fn pull(repo_path: &str) -> anyhow::Result<std::path::PathBuf> {
	let api = Api::new()?;
	let repo = api.model("Ashiedu/Synesthesia".to_string());
	Ok(repo.get(repo_path)?)
	// Cached: ~/.cache/huggingface/hub/
	}

	// Example
	let path = pull("midi/perfrnn/fp16.onnx")?;
	```

	## Pulling Models in Python

	```python
	from huggingface_hub import snapshot_download, hf_hub_download

	# Pull everything
	snapshot_download("Ashiedu/Synesthesia", local_dir="./models")

	# Pull one file
	hf_hub_download(
	repo_id="Ashiedu/Synesthesia",
	filename="midi/perfrnn/fp16.onnx",
	local_dir="./models",
	)
	```

	---

	## Export Workflow (Colab)

	All models are exported from Colab and pushed here. The generic workflow:

	```python
	# 1. Pull existing checkpoint (if updating)
	from huggingface_hub import snapshot_download
	snapshot_download("Ashiedu/Synesthesia", local_dir="./models", token=HF_TOKEN)

	# 2. Clone Magenta source
	# !git clone https://github.com/magenta/magenta
	# !git clone https://github.com/magenta/magenta-realtime

	# 3. Export to ONNX (varies per model — see each model's README)
	# Magenta Classic: tf2onnx
	# Magenta RT: JAX → onnx via jax2onnx or flax export
	# Gemma-3N: Unsloth → GGUF

	# 4. Quantize
	from onnxruntime.quantization import quantize_dynamic, QuantType
	import onnxconverter_common as occ, onnx

	fp32 = onnx.load("model.onnx")
	fp16 = occ.convert_float_to_float16(fp32, keep_io_types=True)
	onnx.save(fp16, "model_fp16.onnx")
	quantize_dynamic("model.onnx", "model_int8.onnx", weight_type=QuantType.QInt8)

	# 5. Push to HF
	from huggingface_hub import HfApi
	api = HfApi(token=HF_TOKEN) # set in Colab Secrets
	api.upload_file(
	path_or_fileobj="model_fp16.onnx",
	path_in_repo="midi/perfrnn/fp16.onnx",
	repo_id="Ashiedu/Synesthesia",
	commit_message="MC-001 Performance RNN fp16",
	)
	```

	Gemini on Colab: Point Gemini at this README and the model's subdirectory
	README as context. Gemini can execute the export + push workflow without
	GitHub integration — it only needs Python and your HF token in Colab Secrets.

	---

	## Burn Compatibility Tracking

	CI weekly attempts `burn-onnx ModelGen` on each exported model.
	Models migrate from ORT fallback to Burn as op coverage matures.

	\| Model \| Burn target \| ORT fallback \| Last checked \|
	\|-------\|------------\|--------------\|-------------\|
	\| DDSP enc/dec \| ✅ \| ❌ \| — \|
	\| GANSynth \| ✅ \| ❌ \| — \|
	\| NSynth \| ✅ \| ❌ \| — \|
	\| Piano Genie \| ✅ \| ❌ \| — \|
	\| Performance RNN \| 🔄 LSTM \| ✅ \| — \|
	\| Melody RNN \| 🔄 LSTM \| ✅ \| — \|
	\| Drums RNN \| 🔄 LSTM \| ✅ \| — \|
	\| Improv RNN \| 🔄 LSTM \| ✅ \| — \|
	\| Polyphony RNN \| 🔄 LSTM \| ✅ \| — \|
	\| MusicVAE \| 🔄 BiLSTM \| ✅ \| — \|
	\| Coconet \| 🔄 Conv \| ✅ \| — \|
	\| Music Transformer \| 🔄 Attention \| ✅ \| — \|
	\| Onsets & Frames \| 🔄 Conv+LSTM \| ✅ \| — \|
	\| SpectroStream \| 🔄 Conv \| ✅ \| — \|
	\| MusicCoCa \| 🔄 ViT+Transformer \| ✅ \| — \|
	\| Gemma-3N \| N/A — llama.cpp \| ❌ \| — \|

	---

	## Training Philosophy

	Train after the app works. The interface ships first. Training data
	is determined by what the working app actually receives as input in practice.
	Fine-tune on your own audio and MIDI once the signal chain is wired.

	Tentative fine-tuning order once the app is functional:
	1. Performance RNN — live MIDI from the Track Mixer
	2. MusicVAE / GrooVAE — latent interpolation between patches
	3. GANSynth — timbre generation from pitch + latent input
	4. DDSP — resynthesis of GANSynth outputs
	5. Magenta RT — full audio, conditioned on your own catalog
	6. Gemma-3N — camera → mood/energy trained on your session recordings

	---

	## License

	- Codebase: Apache 2.0
	- Magenta Classic weights: Apache 2.0
	- Magenta RT weights: Apache 2.0 with additional [bespoke terms](https://github.com/magenta/magenta-realtime/blob/main/LICENSE)
	- Gemma-3N: [Gemma Terms of Use](https://ai.google.dev/gemma/terms)

	Individual model directories note any additional upstream license terms.

	---

	## Links

	- App: [kryptodogg/synesthesia](https://github.com/kryptodogg/synesthesia)
	- Magenta RT: [magenta/magenta-realtime](https://github.com/magenta/magenta-realtime)
	- Magenta Classic: [magenta/magenta](https://github.com/magenta/magenta)
	- HF Model Card: [google/magenta-realtime](https://huggingface.co/google/magenta-realtime)
	- Roadmap: GitHub Issues — `lane:ml` label