Magenta RT Fine-tune: Reverb (4) × Low-Pass Filter (4)

It is in its early stages and was built by me as a side-project.

This repository contains a fine-tuned Magenta RealTime (Magenta RT) checkpoint that adds realtime, chunk-by-chunk controllability for:

Reverb: 4 levels (dry, light, medium, heavy)
Low-pass filter (LPF): 4 levels (open, light, medium, heavy)
drum-stem-only

Controls are applied via a single control token in lane 0 placed in the vocab gap (after codec vocab, before style vocab). All other style lanes remain normal MusicCoCa style tokens to avoid entanglement.

What this model does

Magenta RT generates audio in 2-second chunks, conditioned on a 10-second context. This fine-tune keeps the original streaming behavior but allows you to switch effect controls live:

Toggle reverb and LPF while generating on the drum stem
Changes take effect on the next chunk(s)
Instrumentation/timbre should stay relatively stable because style lanes remain MusicCoCa-based

Key idea: Lane-0-only control (avoid multi-lane entanglement)

During training and inference:

Lane 0 is overwritten with a control token
Lanes 1–5 remain the “real” MusicCoCa style tokens (computed from audio/text style prompts)

This approach was chosen because earlier attempts to encode multiple controls across multiple lanes led to stronger entanglement (style drift, instability, and weaker control separation).

Control encoding (16 states)

The control token encodes one of 16 states (4 reverb × 4 LPF). The token is:


control_token = VOCAB_CONTROL_OFFSET + state_id
VOCAB_CONTROL_OFFSET = vocab_codec_offset + vocab_codec_size
state_id ∈ [0..15]

State mapping

The default mapping is:


state_id = reverb_id * 4 + lpf_id

Where:

reverb_id:
- 0 = dry
- 1 = light
- 2 = medium
- 3 = heavy
lpf_id:
- 0 = open (no LPF / brightest)
- 1 = light
- 2 = medium
- 3 = heavy (darkest)

Streaming demo (Colab)

A Colab notebook is provided (recommended) with:

streaming UI similar to the official Magenta RT demo
text/audio style prompts
live Reverb + LPF toggle controls
optional live recording of output audio

https://github.com/toofaloof/magenta-realtime-mixing

Required inference patch (lane-0 override)

To match training, lane-0 must be overwritten after style tokens are computed, leaving lanes 1–5 untouched.

You can find the patched system.py in the repo and notebook.

Dataset synthesis (how the 16 states are produced)

For each dry reference track, I synthesized a Cartesian product of effect settings:

4 reverb levels × 4 low-pass levels = 16 states
each state is rendered offline and stored as RVQ tokens

During training, a preprocessor samples:

a random time window (10s context + 2s target)
a source state for context
a target state for continuation

The control token encodes the target state, forcing the model to obey the setting. This approach isn't as scalable as I would like.

Training data (high level)

This fine-tune was trained using a mixture of:

Slakh2100 (audio files + custom renderings based on MIDI)
MUSDB (stem-separated music)
additional stem-separated music sources
SFX / effect augmentations applied with Pedalboard (reverb + low-pass variations)

Prior attempt: “token appending” approach (what I tried and why I didn’t keep it)

Before settling on lane-0 control (where I have some scaling concerns), I also experimented with appending control tokens to the input sequence (e.g., placing extra tokens after the codec tokens and/or after style tokens).

In practice, this was less reliable for this setup:

control compliance was weaker (especially under streaming / chunk boundary conditions)
it sometimes interacted poorly with packing/length constraints and preprocessor cropping
it tended to be more brittle than the lane-0 override strategy I already used successfully for reverb

I ultimately moved to lane-0-only injection in the vocab gap, which gave more consistent controllability without disturbing style lanes.

Known limitations

Controls may not be perfectly linear in perceived intensity.
Some instability/artifacts may appear and there could be some entanglements.
If you generate without a stable style reference, instrumentation can drift more.
Base Magenta RT coverage is stronger for Western instrumental music than vocals.

Intended use

Realtime generative music with controllable reverb + low-pass on the drum stem
Interactive DJ / performance tools
Prototyping effect-conditioned generation

Not intended for:

producing exact recreations of copyrighted works
vocal-focused generation (limited coverage)

License

Magenta RT codebase: Apache 2.0 (upstream). Magenta RT weights: CC BY 4.0 (upstream).

This repo distributes a fine-tuned checkpoint derived from the open weights. Please keep attribution consistent with the upstream model card and licenses.

Citation / attribution

If you use this checkpoint, please cite:

Magenta RealTime (Magenta RT) upstream project
the datasets used (Slakh2100, MUSDB)
this fine-tuned checkpoint repo

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for atoof/magenta-realtime-mixing

Base model

google/magenta-realtime

Finetuned

(4)

this model