Magenta RealTime 2 β€” Local (Windows / consumer GPU)

Run Magenta RealTime 2 (Google's real-time music model) locally on a consumer NVIDIA GPU β€” Windows included β€” with the same web UI as the official magenta-community/magenta-rt-jam Space, but without the ZeroGPU / spaces infrastructure.

Tested on an RTX 4070 (12 GB) on Windows 11: mrt2_small streams at ~60% of the real-time budget (β‰ˆ1.6Γ— faster than real-time), smooth, with flat VRAM.

This is a thin local driver on top of the official PyTorch port. It does not modify any of Google's model code β€” everything lives in a single app_local.py. Original work Β© Google LLC, Apache-2.0. See Credits.


Why this exists

The official Space runs on ZeroGPU (datacenter A100/H200, Linux) and ships an MLX path for Apple Silicon. There was no supported way to run it locally on a Windows / consumer NVIDIA GPU. This project fills that gap.

The official Space's app.py assumes:

  • import spaces + @spaces.GPU (ZeroGPU grant) β€” the browser hangs on "Obtaining ZeroGPU…" with no cloud backend;
  • Linux paths (/tmp, /data) and re-downloads weights on every boot;
  • preloading both models (small + base) at once β€” fine on an 80 GB A100, an instant OOM on 12 GB;
  • AOTI artifacts compiled for A10G (sm_86) β€” they don't load on Ada (sm_89).

app_local.py resolves all of these for local use.

The key finding (the part worth reading)

Out of the box, streaming leaks VRAM (~17 MB/frame β†’ OOM in ~20 s) and stutters. After a long hunt the root cause turned out to be a one-liner:

The model's decode_stream, generate, and stream methods carry @torch.no_grad, but step_f and encode do not. The local stream calls step_f/encode directly (not through generate), so autograd stays on and builds a gradient graph every frame whose activations pile up in VRAM. The official Space never noticed: on ZeroGPU the 80 GB of headroom and the ~55 s grant-reset hide it.

The fix is to disable gradients in the streaming threads:

torch.set_grad_enabled(False)   # main thread (warm-up) AND the worker thread (thread-local)

With that single change, on the same RTX 4070:

Mode Per-frame VRAM over 1500 frames Verdict
eager, autograd on (original) 70–80 ms, climbing 1 GB β†’ 26 GB β†’ OOM unusable
eager, no_grad ~70 ms flat ~1 GB stable, not real-time
torch.compile + no_grad 24 ms (60%) flat ~1 GB smooth, real-time

If you maintain the upstream port: adding @torch.no_grad() to step_f (and encode) would fix this for everyone and likely cut VRAM on ZeroGPU too.

This has been proposed upstream: magenta-rt-jam Β· discussions/1

What app_local.py changes vs the upstream app.py

  1. No ZeroGPU. A no-op shim replaces spaces.GPU; the stream runs on your local GPU.
  2. torch.set_grad_enabled(False) in the main (warm-up) and worker threads β€” the fix above.
  3. torch.compile (dynamic) on the per-frame step fns + a full warm-up at start (covers all KV-cache lengths up to temporal_max_past, so no recompiles mid-stream). The upstream fast path (AOTI) is unavailable: it's A10G-only, and AOTInductor is hard-blocked on Windows in PyTorch.
  4. One model at a time (MRT_LOAD=small|base|both, default small) β€” 12 GB can't hold both.
  5. In-memory state instead of JSON-on-disk for the conditioning slot (single process here).
  6. Local paths, no weight re-download.

Requirements

  • Windows 10/11 (or Linux), an NVIDIA GPU with β‰₯ 12 GB (RTX 3060 12 GB / 4070 and up)
  • Python 3.11
  • A Hugging Face token with access to google/magenta-realtime-2

Setup

# 1. Get the official PyTorch port (the model code lives here)
git clone https://huggingface.co/spaces/magenta-community/magenta-rt-jam jam
# (uses GIT_LFS_SKIP_SMUDGE=1 if you only want the code, not the LFS demo assets)

# 2. Python env
py -3.11 -m venv .venv
.venv\Scripts\python -m pip install -r requirements-local.txt
.venv\Scripts\python -m pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

# 3. Download weights (~1.1 GB small, ~9.2 GB base) β€” needs your HF token
set HF_TOKEN=hf_xxx
.venv\Scripts\python download_weights.py small      # or: base / both

# 4. Drop our driver in next to the model package and run
copy app_local.py jam\app_local.py
cd jam
..\.venv\Scripts\python app_local.py

Open http://127.0.0.1:7860, press play, and play the on-screen keys (or a MIDI keyboard via the browser's Web MIDI β€” the UI has a MIDI input selector).

Environment variables

Var Default Meaning
MAGENTA_HOME ./magenta-home where weights live (<home>/magenta-rt-v2/checkpoints)
MRT_LOAD small small | base | both (12 GB β†’ keep small)
MRT_COMPILE 1 1 = torch.compile (smooth); 0 = eager (slower, no warm-up wait)

Notes & gotchas

  • First start takes ~30–60 s: torch.compile warms every KV-cache shape before serving.
  • mrt2_base (2.4B) does not fit alongside small on 12 GB; load it alone (MRT_LOAD=base).
  • On torch 2.8 the port OOMs in eager and torch.compile needs MSVC β€” stay on 2.6.

Credits

  • Magenta RealTime 2 and the PyTorch port: Β© Google LLC, licensed under Apache-2.0. The model code and web UI come from the magenta-community/magenta-rt-jam Space and the google/magenta-realtime-2 weights.
  • This repo adds only app_local.py (the local driver) and this documentation. It is an independent, unofficial integration and is not affiliated with or endorsed by Google.

See NOTICE for attribution details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support