Magenta RT Fine-tune: Reverb (4) × Low-Pass Filter (4)
It is in its early stages and was built by me as a side-project.
This repository contains a fine-tuned Magenta RealTime (Magenta RT) checkpoint that adds realtime, chunk-by-chunk controllability for:
- Reverb: 4 levels (
dry,light,medium,heavy) - Low-pass filter (LPF): 4 levels (
open,light,medium,heavy) - drum-stem-only
Controls are applied via a single control token in lane 0 placed in the vocab gap (after codec vocab, before style vocab). All other style lanes remain normal MusicCoCa style tokens to avoid entanglement.
What this model does
Magenta RT generates audio in 2-second chunks, conditioned on a 10-second context. This fine-tune keeps the original streaming behavior but allows you to switch effect controls live:
- Toggle reverb and LPF while generating on the drum stem
- Changes take effect on the next chunk(s)
- Instrumentation/timbre should stay relatively stable because style lanes remain MusicCoCa-based
Key idea: Lane-0-only control (avoid multi-lane entanglement)
During training and inference:
- Lane 0 is overwritten with a control token
- Lanes 1–5 remain the “real” MusicCoCa style tokens (computed from audio/text style prompts)
This approach was chosen because earlier attempts to encode multiple controls across multiple lanes led to stronger entanglement (style drift, instability, and weaker control separation).
Control encoding (16 states)
The control token encodes one of 16 states (4 reverb × 4 LPF). The token is:
control_token = VOCAB_CONTROL_OFFSET + state_id
VOCAB_CONTROL_OFFSET = vocab_codec_offset + vocab_codec_size
state_id ∈ [0..15]
State mapping
The default mapping is:
state_id = reverb_id * 4 + lpf_id
Where:
reverb_id:0 = dry1 = light2 = medium3 = heavy
lpf_id:0 = open (no LPF / brightest)1 = light2 = medium3 = heavy (darkest)
Streaming demo (Colab)
A Colab notebook is provided (recommended) with:
- streaming UI similar to the official Magenta RT demo
- text/audio style prompts
- live Reverb + LPF toggle controls
- optional live recording of output audio
https://github.com/toofaloof/magenta-realtime-mixing
Required inference patch (lane-0 override)
To match training, lane-0 must be overwritten after style tokens are computed, leaving lanes 1–5 untouched.
You can find the patched system.py in the repo and notebook.
Dataset synthesis (how the 16 states are produced)
For each dry reference track, I synthesized a Cartesian product of effect settings:
- 4 reverb levels × 4 low-pass levels = 16 states
- each state is rendered offline and stored as RVQ tokens
During training, a preprocessor samples:
- a random time window (10s context + 2s target)
- a source state for context
- a target state for continuation
The control token encodes the target state, forcing the model to obey the setting. This approach isn't as scalable as I would like.
Training data (high level)
This fine-tune was trained using a mixture of:
- Slakh2100 (audio files + custom renderings based on MIDI)
- MUSDB (stem-separated music)
- additional stem-separated music sources
- SFX / effect augmentations applied with Pedalboard (reverb + low-pass variations)
Prior attempt: “token appending” approach (what I tried and why I didn’t keep it)
Before settling on lane-0 control (where I have some scaling concerns), I also experimented with appending control tokens to the input sequence (e.g., placing extra tokens after the codec tokens and/or after style tokens).
In practice, this was less reliable for this setup:
- control compliance was weaker (especially under streaming / chunk boundary conditions)
- it sometimes interacted poorly with packing/length constraints and preprocessor cropping
- it tended to be more brittle than the lane-0 override strategy I already used successfully for reverb
I ultimately moved to lane-0-only injection in the vocab gap, which gave more consistent controllability without disturbing style lanes.
Known limitations
- Controls may not be perfectly linear in perceived intensity.
- Some instability/artifacts may appear and there could be some entanglements.
- If you generate without a stable style reference, instrumentation can drift more.
- Base Magenta RT coverage is stronger for Western instrumental music than vocals.
Intended use
- Realtime generative music with controllable reverb + low-pass on the drum stem
- Interactive DJ / performance tools
- Prototyping effect-conditioned generation
Not intended for:
- producing exact recreations of copyrighted works
- vocal-focused generation (limited coverage)
License
Magenta RT codebase: Apache 2.0 (upstream). Magenta RT weights: CC BY 4.0 (upstream).
This repo distributes a fine-tuned checkpoint derived from the open weights. Please keep attribution consistent with the upstream model card and licenses.
Citation / attribution
If you use this checkpoint, please cite:
- Magenta RealTime (Magenta RT) upstream project
- the datasets used (Slakh2100, MUSDB)
- this fine-tuned checkpoint repo
- Downloads last month
- -
Model tree for atoof/magenta-realtime-mixing
Base model
google/magenta-realtime