--- license: apache-2.0 task_categories: - audio-to-audio - text-to-audio - image-to-text tags: - music-generation - magenta - magenta-rt - onnx - burn - llama-cpp - performance-rnn - melody-rnn - drums-rnn - improv-rnn - polyphony-rnn - musicvae - groovae - piano-genie - ddsp - gansynth - nsynth - coconet - music-transformer - onsets-and-frames - spectrostream - musiccoca - synesthesia - directml - vulkan - wgpu - audio - midi language: - en library_name: onnxruntime base_model: - unsloth/gemma-3n-E2B-it - google/magenta-realtime --- # Synesthesia — AI Music Models ONNX and GGUF model weights for [Synesthesia](https://github.com/kryptodogg/synesthesia), a cyber-physical synthesizer, 3D/4D signal workstation, and multi-modal music AI app. Synesthesia brings together every open-weights model from **Magenta Classic** and **Magenta RT** under one repo, exportable to ONNX for local inference and continuously fine-tunable via free Google Colab notebooks. --- ## Inference Runtimes | Runtime | Models | Backend | Notes | |---------|--------|---------|-------| | **Burn wgpu** | DDSP, GANSynth, NSynth, Piano Genie | Vulkan / DX12 | Pure Rust, no ROCm required | | **ORT + DirectML** | RNN family, MusicVAE, Coconet, Onsets & Frames | DirectML | Fallback while Burn op coverage matures | | **llama.cpp + Vulkan** | Gemma-3N | Vulkan | Same stack as LM Studio, GGUF format | | **Magenta RT (JAX)** | Magenta RT LLM, SpectroStream, MusicCoCa | TPU / GPU | Free Colab TPU v2-8 for inference + finetuning | Vulkan works on AMD without ROCm on Windows 11. All runtimes target the RX 6700 XT. --- ## Model Inventory ### Magenta RT (Real-Time Audio Generation) Magenta RT is composed of three components working as a pipeline: SpectroStream (audio codec), MusicCoCa (style embeddings), and an encoder-decoder transformer LLM — the only open-weights model supporting real-time continuous musical audio generation. It is an 800 million parameter autoregressive transformer trained on ~190k hours of stock music. It uses 38% fewer parameters than Stable Audio Open and 77% fewer than MusicGen Large. | ID | Model | Format | Task | Synesthesia Role | |----|-------|--------|------|-----------------| | MRT-001 | Magenta RT LLM | JAX / ONNX | Real-time stereo audio generation | Continuous live generation engine | | MRT-002 | SpectroStream Encoder | ONNX | Audio → discrete tokens (48kHz stereo, 25Hz, 64 RVQ) | Audio tokenizer | | MRT-003 | SpectroStream Decoder | ONNX | Tokens → 48kHz stereo audio | Audio detokenizer | | MRT-004 | MusicCoCa Text | ONNX | Text → 768-dim music embedding | Text prompt → style control | | MRT-005 | MusicCoCa Audio | ONNX | Audio → 768-dim music embedding | Audio prompt → style control | **Finetuning:** Free Colab TPU v2-8 via `Magenta_RT_Finetune.ipynb`. Customize to your own audio catalog. Official Colab demos support live generation, finetuning, and live audio injection (audio injection = mix user audio with model output and feed as context for next generation chunk). --- ### Magenta Classic — MIDI / Symbolic MusicRNN implements Magenta's LSTM-based language models: MelodyRNN, DrumsRNN, ImprovRNN, and PerformanceRNN. | ID | Model | Format | Task | Synesthesia Role | |----|-------|--------|------|-----------------| | MC-001 | Performance RNN | ONNX | Expressive MIDI performance generation | AI arpeggiator, live note generation | | MC-002 | Melody RNN | ONNX | Melody continuation (LSTM) | Melody continuation tool | | MC-003 | Drums RNN | ONNX | Drum pattern generation (LSTM) | Beat generation | | MC-004 | Improv RNN | ONNX | Chord-conditioned melody generation | Live improv over chord progressions | | MC-005 | Polyphony RNN | ONNX | Polyphonic music generation (BachBot) | Harmonic voice generation | | MC-006 | MusicVAE | ONNX enc+dec | Latent music VAE — melody, drum, trio loops | Latent interpolation, style morphing | | MC-007 | GrooVAE | ONNX enc+dec | Drum performance humanization | Humanize MIDI drums | | MC-008 | MidiMe | ONNX | Personalize MusicVAE in-session | User-adaptive latent space | | MC-009 | Music Transformer | ONNX | Long-form piano generation | Extended composition | | MC-010 | Coconet | ONNX | Counterpoint by convolution — complete partial scores | Harmony / counterpoint filler | --- ### Magenta Classic — Audio / Timbre | ID | Model | Format | Task | Synesthesia Role | |----|-------|--------|------|-----------------| | MA-001 | GANSynth | ONNX | GAN audio synthesis from NSynth timbres | GANHarp-style timbre instrument | | MA-002 | NSynth | ONNX | WaveNet neural audio synthesis | Sample-level timbre generation | | MA-003 | DDSP Encoder | ONNX | Audio → harmonic + noise params | Timbre analysis | | MA-004 | DDSP Decoder | ONNX | Harmonic params → audio | Timbre resynthesis | | MA-005 | Piano Genie | ONNX | 8-button → 88-key piano VQ-VAE | Accessible piano performance | | MA-006 | Onsets and Frames | ONNX | Polyphonic piano transcription (audio → MIDI) | Audio → MIDI transcription | | MA-007 | SPICE | ONNX | Pitch extraction from audio | Monophonic pitch tracking | --- ### LLM / Vision Control | ID | Model | Format | Task | Synesthesia Role | |----|-------|--------|------|-----------------| | LV-001 | Gemma-3N e2b-it | GGUF | Vision + text → structured JSON | Camera → mood/energy/key control | **Format tiers:** - `q4_k_m.gguf` — default (recommended, ~1.5GB) - `q2_k.gguf` — lite tier (fastest, smallest) - `f16.gguf` — full quality reference **Runtime:** `llama-cpp-v3` Rust crate with Vulkan backend. Same stack as LM Studio — no ROCm, no CUDA needed on Windows. --- ## Repository Structure ``` Ashiedu/Synesthesia/ │ ├── manifest.json ← authoritative model registry │ ├── magenta_rt/ │ ├── llm/ ← MRT-001: JAX checkpoint + ONNX export │ ├── spectrostream/ │ │ ├── encoder_fp32.onnx │ │ ├── encoder_fp16.onnx │ │ ├── decoder_fp32.onnx │ │ └── decoder_fp16.onnx │ └── musiccoca/ │ ├── text_fp32.onnx │ ├── text_fp16.onnx │ ├── audio_fp32.onnx │ └── audio_fp16.onnx │ ├── midi/ │ ├── perfrnn/ ← MC-001: fp32 / fp16 / int8 │ ├── melody_rnn/ ← MC-002 │ ├── drums_rnn/ ← MC-003 │ ├── improv_rnn/ ← MC-004 │ ├── polyphony_rnn/ ← MC-005 │ ├── musicvae/ ← MC-006: encoder + decoder │ ├── groovae/ ← MC-007 │ ├── midime/ ← MC-008 │ ├── music_transformer/ ← MC-009 │ └── coconet/ ← MC-010 │ ├── audio/ │ ├── gansynth/ ← MA-001: fp32 / fp16 │ ├── nsynth/ ← MA-002 │ ├── ddsp/ ← MA-003+004: encoder + decoder │ ├── piano_genie/ ← MA-005 │ ├── onsets_and_frames/ ← MA-006 │ └── spice/ ← MA-007 │ └── llm/ └── gemma3n_e2b/ ├── q4_k_m.gguf ← LV-001: default ├── q2_k.gguf └── f16.gguf ``` Each subdirectory contains a `README.md` with input/output shapes, export commands, and Burn compatibility status. --- ## Quality Tiers (ONNX models) | Tier | Suffix | VRAM est. | Use case | |------|--------|-----------|----------| | Full | `_fp32.onnx` | ~2–4× Half | Reference quality, CI validation | | **Half** | `_fp16.onnx` | Baseline | **Default — recommended for RX 6700 XT** | | Lite | `_int8.onnx` | ~0.5× Half | Lowest latency (MIDI models only) | --- ## Pulling Models in Rust ```rust use hf_hub::api::sync::Api; pub fn pull(repo_path: &str) -> anyhow::Result { let api = Api::new()?; let repo = api.model("Ashiedu/Synesthesia".to_string()); Ok(repo.get(repo_path)?) // Cached: ~/.cache/huggingface/hub/ } // Example let path = pull("midi/perfrnn/fp16.onnx")?; ``` ## Pulling Models in Python ```python from huggingface_hub import snapshot_download, hf_hub_download # Pull everything snapshot_download("Ashiedu/Synesthesia", local_dir="./models") # Pull one file hf_hub_download( repo_id="Ashiedu/Synesthesia", filename="midi/perfrnn/fp16.onnx", local_dir="./models", ) ``` --- ## Export Workflow (Colab) All models are exported from Colab and pushed here. The generic workflow: ```python # 1. Pull existing checkpoint (if updating) from huggingface_hub import snapshot_download snapshot_download("Ashiedu/Synesthesia", local_dir="./models", token=HF_TOKEN) # 2. Clone Magenta source # !git clone https://github.com/magenta/magenta # !git clone https://github.com/magenta/magenta-realtime # 3. Export to ONNX (varies per model — see each model's README) # Magenta Classic: tf2onnx # Magenta RT: JAX → onnx via jax2onnx or flax export # Gemma-3N: Unsloth → GGUF # 4. Quantize from onnxruntime.quantization import quantize_dynamic, QuantType import onnxconverter_common as occ, onnx fp32 = onnx.load("model.onnx") fp16 = occ.convert_float_to_float16(fp32, keep_io_types=True) onnx.save(fp16, "model_fp16.onnx") quantize_dynamic("model.onnx", "model_int8.onnx", weight_type=QuantType.QInt8) # 5. Push to HF from huggingface_hub import HfApi api = HfApi(token=HF_TOKEN) # set in Colab Secrets api.upload_file( path_or_fileobj="model_fp16.onnx", path_in_repo="midi/perfrnn/fp16.onnx", repo_id="Ashiedu/Synesthesia", commit_message="MC-001 Performance RNN fp16", ) ``` **Gemini on Colab:** Point Gemini at this README and the model's subdirectory README as context. Gemini can execute the export + push workflow without GitHub integration — it only needs Python and your HF token in Colab Secrets. --- ## Burn Compatibility Tracking CI weekly attempts `burn-onnx ModelGen` on each exported model. Models migrate from ORT fallback to Burn as op coverage matures. | Model | Burn target | ORT fallback | Last checked | |-------|------------|--------------|-------------| | DDSP enc/dec | ✅ | ❌ | — | | GANSynth | ✅ | ❌ | — | | NSynth | ✅ | ❌ | — | | Piano Genie | ✅ | ❌ | — | | Performance RNN | 🔄 LSTM | ✅ | — | | Melody RNN | 🔄 LSTM | ✅ | — | | Drums RNN | 🔄 LSTM | ✅ | — | | Improv RNN | 🔄 LSTM | ✅ | — | | Polyphony RNN | 🔄 LSTM | ✅ | — | | MusicVAE | 🔄 BiLSTM | ✅ | — | | Coconet | 🔄 Conv | ✅ | — | | Music Transformer | 🔄 Attention | ✅ | — | | Onsets & Frames | 🔄 Conv+LSTM | ✅ | — | | SpectroStream | 🔄 Conv | ✅ | — | | MusicCoCa | 🔄 ViT+Transformer | ✅ | — | | Gemma-3N | N/A — llama.cpp | ❌ | — | --- ## Training Philosophy **Train after the app works.** The interface ships first. Training data is determined by what the working app actually receives as input in practice. Fine-tune on your own audio and MIDI once the signal chain is wired. Tentative fine-tuning order once the app is functional: 1. Performance RNN — live MIDI from the Track Mixer 2. MusicVAE / GrooVAE — latent interpolation between patches 3. GANSynth — timbre generation from pitch + latent input 4. DDSP — resynthesis of GANSynth outputs 5. Magenta RT — full audio, conditioned on your own catalog 6. Gemma-3N — camera → mood/energy trained on your session recordings --- ## License - Codebase: Apache 2.0 - Magenta Classic weights: Apache 2.0 - Magenta RT weights: Apache 2.0 with additional [bespoke terms](https://github.com/magenta/magenta-realtime/blob/main/LICENSE) - Gemma-3N: [Gemma Terms of Use](https://ai.google.dev/gemma/terms) Individual model directories note any additional upstream license terms. --- ## Links - **App:** [kryptodogg/synesthesia](https://github.com/kryptodogg/synesthesia) - **Magenta RT:** [magenta/magenta-realtime](https://github.com/magenta/magenta-realtime) - **Magenta Classic:** [magenta/magenta](https://github.com/magenta/magenta) - **HF Model Card:** [google/magenta-realtime](https://huggingface.co/google/magenta-realtime) - **Roadmap:** GitHub Issues — `lane:ml` label