Synesthesia / README.md
Ashiedu's picture
Update README.md
4a3604c verified
---
license: apache-2.0
task_categories:
- audio-to-audio
- text-to-audio
- image-to-text
tags:
- music-generation
- magenta
- magenta-rt
- onnx
- burn
- llama-cpp
- performance-rnn
- melody-rnn
- drums-rnn
- improv-rnn
- polyphony-rnn
- musicvae
- groovae
- piano-genie
- ddsp
- gansynth
- nsynth
- coconet
- music-transformer
- onsets-and-frames
- spectrostream
- musiccoca
- synesthesia
- directml
- vulkan
- wgpu
- audio
- midi
language:
- en
library_name: onnxruntime
base_model:
- unsloth/gemma-3n-E2B-it
- google/magenta-realtime
---
# Synesthesia β€” AI Music Models
ONNX and GGUF model weights for [Synesthesia](https://github.com/kryptodogg/synesthesia),
a cyber-physical synthesizer, 3D/4D signal workstation, and multi-modal music AI app.
Synesthesia brings together every open-weights model from **Magenta Classic** and
**Magenta RT** under one repo, exportable to ONNX for local inference and continuously
fine-tunable via free Google Colab notebooks.
---
## Inference Runtimes
| Runtime | Models | Backend | Notes |
|---------|--------|---------|-------|
| **Burn wgpu** | DDSP, GANSynth, NSynth, Piano Genie | Vulkan / DX12 | Pure Rust, no ROCm required |
| **ORT + DirectML** | RNN family, MusicVAE, Coconet, Onsets & Frames | DirectML | Fallback while Burn op coverage matures |
| **llama.cpp + Vulkan** | Gemma-3N | Vulkan | Same stack as LM Studio, GGUF format |
| **Magenta RT (JAX)** | Magenta RT LLM, SpectroStream, MusicCoCa | TPU / GPU | Free Colab TPU v2-8 for inference + finetuning |
Vulkan works on AMD without ROCm on Windows 11. All runtimes target the RX 6700 XT.
---
## Model Inventory
### Magenta RT (Real-Time Audio Generation)
Magenta RT is composed of three components working as a pipeline:
SpectroStream (audio codec), MusicCoCa (style embeddings), and an encoder-decoder
transformer LLM β€” the only open-weights model supporting real-time continuous
musical audio generation.
It is an 800 million parameter autoregressive transformer trained on
~190k hours of stock music. It uses 38% fewer parameters
than Stable Audio Open and 77% fewer than MusicGen Large.
| ID | Model | Format | Task | Synesthesia Role |
|----|-------|--------|------|-----------------|
| MRT-001 | Magenta RT LLM | JAX / ONNX | Real-time stereo audio generation | Continuous live generation engine |
| MRT-002 | SpectroStream Encoder | ONNX | Audio β†’ discrete tokens (48kHz stereo, 25Hz, 64 RVQ) | Audio tokenizer |
| MRT-003 | SpectroStream Decoder | ONNX | Tokens β†’ 48kHz stereo audio | Audio detokenizer |
| MRT-004 | MusicCoCa Text | ONNX | Text β†’ 768-dim music embedding | Text prompt β†’ style control |
| MRT-005 | MusicCoCa Audio | ONNX | Audio β†’ 768-dim music embedding | Audio prompt β†’ style control |
**Finetuning:** Free Colab TPU v2-8 via `Magenta_RT_Finetune.ipynb`. Customize to
your own audio catalog. Official Colab demos support live generation,
finetuning, and live audio injection (audio injection = mix user audio with model
output and feed as context for next generation chunk).
---
### Magenta Classic β€” MIDI / Symbolic
MusicRNN implements Magenta's LSTM-based language models:
MelodyRNN, DrumsRNN, ImprovRNN, and PerformanceRNN.
| ID | Model | Format | Task | Synesthesia Role |
|----|-------|--------|------|-----------------|
| MC-001 | Performance RNN | ONNX | Expressive MIDI performance generation | AI arpeggiator, live note generation |
| MC-002 | Melody RNN | ONNX | Melody continuation (LSTM) | Melody continuation tool |
| MC-003 | Drums RNN | ONNX | Drum pattern generation (LSTM) | Beat generation |
| MC-004 | Improv RNN | ONNX | Chord-conditioned melody generation | Live improv over chord progressions |
| MC-005 | Polyphony RNN | ONNX | Polyphonic music generation (BachBot) | Harmonic voice generation |
| MC-006 | MusicVAE | ONNX enc+dec | Latent music VAE β€” melody, drum, trio loops | Latent interpolation, style morphing |
| MC-007 | GrooVAE | ONNX enc+dec | Drum performance humanization | Humanize MIDI drums |
| MC-008 | MidiMe | ONNX | Personalize MusicVAE in-session | User-adaptive latent space |
| MC-009 | Music Transformer | ONNX | Long-form piano generation | Extended composition |
| MC-010 | Coconet | ONNX | Counterpoint by convolution β€” complete partial scores | Harmony / counterpoint filler |
---
### Magenta Classic β€” Audio / Timbre
| ID | Model | Format | Task | Synesthesia Role |
|----|-------|--------|------|-----------------|
| MA-001 | GANSynth | ONNX | GAN audio synthesis from NSynth timbres | GANHarp-style timbre instrument |
| MA-002 | NSynth | ONNX | WaveNet neural audio synthesis | Sample-level timbre generation |
| MA-003 | DDSP Encoder | ONNX | Audio β†’ harmonic + noise params | Timbre analysis |
| MA-004 | DDSP Decoder | ONNX | Harmonic params β†’ audio | Timbre resynthesis |
| MA-005 | Piano Genie | ONNX | 8-button β†’ 88-key piano VQ-VAE | Accessible piano performance |
| MA-006 | Onsets and Frames | ONNX | Polyphonic piano transcription (audio β†’ MIDI) | Audio β†’ MIDI transcription |
| MA-007 | SPICE | ONNX | Pitch extraction from audio | Monophonic pitch tracking |
---
### LLM / Vision Control
| ID | Model | Format | Task | Synesthesia Role |
|----|-------|--------|------|-----------------|
| LV-001 | Gemma-3N e2b-it | GGUF | Vision + text β†’ structured JSON | Camera β†’ mood/energy/key control |
**Format tiers:**
- `q4_k_m.gguf` β€” default (recommended, ~1.5GB)
- `q2_k.gguf` β€” lite tier (fastest, smallest)
- `f16.gguf` β€” full quality reference
**Runtime:** `llama-cpp-v3` Rust crate with Vulkan backend.
Same stack as LM Studio β€” no ROCm, no CUDA needed on Windows.
---
## Repository Structure
```
Ashiedu/Synesthesia/
β”‚
β”œβ”€β”€ manifest.json ← authoritative model registry
β”‚
β”œβ”€β”€ magenta_rt/
β”‚ β”œβ”€β”€ llm/ ← MRT-001: JAX checkpoint + ONNX export
β”‚ β”œβ”€β”€ spectrostream/
β”‚ β”‚ β”œβ”€β”€ encoder_fp32.onnx
β”‚ β”‚ β”œβ”€β”€ encoder_fp16.onnx
β”‚ β”‚ β”œβ”€β”€ decoder_fp32.onnx
β”‚ β”‚ └── decoder_fp16.onnx
β”‚ └── musiccoca/
β”‚ β”œβ”€β”€ text_fp32.onnx
β”‚ β”œβ”€β”€ text_fp16.onnx
β”‚ β”œβ”€β”€ audio_fp32.onnx
β”‚ └── audio_fp16.onnx
β”‚
β”œβ”€β”€ midi/
β”‚ β”œβ”€β”€ perfrnn/ ← MC-001: fp32 / fp16 / int8
β”‚ β”œβ”€β”€ melody_rnn/ ← MC-002
β”‚ β”œβ”€β”€ drums_rnn/ ← MC-003
β”‚ β”œβ”€β”€ improv_rnn/ ← MC-004
β”‚ β”œβ”€β”€ polyphony_rnn/ ← MC-005
β”‚ β”œβ”€β”€ musicvae/ ← MC-006: encoder + decoder
β”‚ β”œβ”€β”€ groovae/ ← MC-007
β”‚ β”œβ”€β”€ midime/ ← MC-008
β”‚ β”œβ”€β”€ music_transformer/ ← MC-009
β”‚ └── coconet/ ← MC-010
β”‚
β”œβ”€β”€ audio/
β”‚ β”œβ”€β”€ gansynth/ ← MA-001: fp32 / fp16
β”‚ β”œβ”€β”€ nsynth/ ← MA-002
β”‚ β”œβ”€β”€ ddsp/ ← MA-003+004: encoder + decoder
β”‚ β”œβ”€β”€ piano_genie/ ← MA-005
β”‚ β”œβ”€β”€ onsets_and_frames/ ← MA-006
β”‚ └── spice/ ← MA-007
β”‚
└── llm/
└── gemma3n_e2b/
β”œβ”€β”€ q4_k_m.gguf ← LV-001: default
β”œβ”€β”€ q2_k.gguf
└── f16.gguf
```
Each subdirectory contains a `README.md` with input/output shapes,
export commands, and Burn compatibility status.
---
## Quality Tiers (ONNX models)
| Tier | Suffix | VRAM est. | Use case |
|------|--------|-----------|----------|
| Full | `_fp32.onnx` | ~2–4Γ— Half | Reference quality, CI validation |
| **Half** | `_fp16.onnx` | Baseline | **Default β€” recommended for RX 6700 XT** |
| Lite | `_int8.onnx` | ~0.5Γ— Half | Lowest latency (MIDI models only) |
---
## Pulling Models in Rust
```rust
use hf_hub::api::sync::Api;
pub fn pull(repo_path: &str) -> anyhow::Result<std::path::PathBuf> {
let api = Api::new()?;
let repo = api.model("Ashiedu/Synesthesia".to_string());
Ok(repo.get(repo_path)?)
// Cached: ~/.cache/huggingface/hub/
}
// Example
let path = pull("midi/perfrnn/fp16.onnx")?;
```
## Pulling Models in Python
```python
from huggingface_hub import snapshot_download, hf_hub_download
# Pull everything
snapshot_download("Ashiedu/Synesthesia", local_dir="./models")
# Pull one file
hf_hub_download(
repo_id="Ashiedu/Synesthesia",
filename="midi/perfrnn/fp16.onnx",
local_dir="./models",
)
```
---
## Export Workflow (Colab)
All models are exported from Colab and pushed here. The generic workflow:
```python
# 1. Pull existing checkpoint (if updating)
from huggingface_hub import snapshot_download
snapshot_download("Ashiedu/Synesthesia", local_dir="./models", token=HF_TOKEN)
# 2. Clone Magenta source
# !git clone https://github.com/magenta/magenta
# !git clone https://github.com/magenta/magenta-realtime
# 3. Export to ONNX (varies per model β€” see each model's README)
# Magenta Classic: tf2onnx
# Magenta RT: JAX β†’ onnx via jax2onnx or flax export
# Gemma-3N: Unsloth β†’ GGUF
# 4. Quantize
from onnxruntime.quantization import quantize_dynamic, QuantType
import onnxconverter_common as occ, onnx
fp32 = onnx.load("model.onnx")
fp16 = occ.convert_float_to_float16(fp32, keep_io_types=True)
onnx.save(fp16, "model_fp16.onnx")
quantize_dynamic("model.onnx", "model_int8.onnx", weight_type=QuantType.QInt8)
# 5. Push to HF
from huggingface_hub import HfApi
api = HfApi(token=HF_TOKEN) # set in Colab Secrets
api.upload_file(
path_or_fileobj="model_fp16.onnx",
path_in_repo="midi/perfrnn/fp16.onnx",
repo_id="Ashiedu/Synesthesia",
commit_message="MC-001 Performance RNN fp16",
)
```
**Gemini on Colab:** Point Gemini at this README and the model's subdirectory
README as context. Gemini can execute the export + push workflow without
GitHub integration β€” it only needs Python and your HF token in Colab Secrets.
---
## Burn Compatibility Tracking
CI weekly attempts `burn-onnx ModelGen` on each exported model.
Models migrate from ORT fallback to Burn as op coverage matures.
| Model | Burn target | ORT fallback | Last checked |
|-------|------------|--------------|-------------|
| DDSP enc/dec | βœ… | ❌ | β€” |
| GANSynth | βœ… | ❌ | β€” |
| NSynth | βœ… | ❌ | β€” |
| Piano Genie | βœ… | ❌ | β€” |
| Performance RNN | πŸ”„ LSTM | βœ… | β€” |
| Melody RNN | πŸ”„ LSTM | βœ… | β€” |
| Drums RNN | πŸ”„ LSTM | βœ… | β€” |
| Improv RNN | πŸ”„ LSTM | βœ… | β€” |
| Polyphony RNN | πŸ”„ LSTM | βœ… | β€” |
| MusicVAE | πŸ”„ BiLSTM | βœ… | β€” |
| Coconet | πŸ”„ Conv | βœ… | β€” |
| Music Transformer | πŸ”„ Attention | βœ… | β€” |
| Onsets & Frames | πŸ”„ Conv+LSTM | βœ… | β€” |
| SpectroStream | πŸ”„ Conv | βœ… | β€” |
| MusicCoCa | πŸ”„ ViT+Transformer | βœ… | β€” |
| Gemma-3N | N/A β€” llama.cpp | ❌ | β€” |
---
## Training Philosophy
**Train after the app works.** The interface ships first. Training data
is determined by what the working app actually receives as input in practice.
Fine-tune on your own audio and MIDI once the signal chain is wired.
Tentative fine-tuning order once the app is functional:
1. Performance RNN β€” live MIDI from the Track Mixer
2. MusicVAE / GrooVAE β€” latent interpolation between patches
3. GANSynth β€” timbre generation from pitch + latent input
4. DDSP β€” resynthesis of GANSynth outputs
5. Magenta RT β€” full audio, conditioned on your own catalog
6. Gemma-3N β€” camera β†’ mood/energy trained on your session recordings
---
## License
- Codebase: Apache 2.0
- Magenta Classic weights: Apache 2.0
- Magenta RT weights: Apache 2.0 with additional [bespoke terms](https://github.com/magenta/magenta-realtime/blob/main/LICENSE)
- Gemma-3N: [Gemma Terms of Use](https://ai.google.dev/gemma/terms)
Individual model directories note any additional upstream license terms.
---
## Links
- **App:** [kryptodogg/synesthesia](https://github.com/kryptodogg/synesthesia)
- **Magenta RT:** [magenta/magenta-realtime](https://github.com/magenta/magenta-realtime)
- **Magenta Classic:** [magenta/magenta](https://github.com/magenta/magenta)
- **HF Model Card:** [google/magenta-realtime](https://huggingface.co/google/magenta-realtime)
- **Roadmap:** GitHub Issues β€” `lane:ml` label