Magenta RealTime 2 — Local (Windows / consumer GPU)

Run Magenta RealTime 2 (Google's real-time music model) locally on a consumer NVIDIA GPU — Windows included — with the same web UI as the official magenta-community/magenta-rt-jam Space, but without the ZeroGPU / spaces infrastructure.

Tested on an RTX 4070 (12 GB) on Windows 11: mrt2_small streams at ~60% of the real-time budget (≈1.6× faster than real-time), smooth, with flat VRAM.

This is a thin local driver on top of the official PyTorch port. It does not modify any of Google's model code — everything lives in a single app_local.py. Original work © Google LLC, Apache-2.0. See Credits.

Why this exists

The official Space runs on ZeroGPU (datacenter A100/H200, Linux) and ships an MLX path for Apple Silicon. There was no supported way to run it locally on a Windows / consumer NVIDIA GPU. This project fills that gap.

The official Space's app.py assumes:

import spaces + @spaces.GPU (ZeroGPU grant) — the browser hangs on "Obtaining ZeroGPU…" with no cloud backend;
Linux paths (/tmp, /data) and re-downloads weights on every boot;
preloading both models (small + base) at once — fine on an 80 GB A100, an instant OOM on 12 GB;
AOTI artifacts compiled for A10G (sm_86) — they don't load on Ada (sm_89).

app_local.py resolves all of these for local use.

The key finding (the part worth reading)

Out of the box, streaming leaks VRAM (~17 MB/frame → OOM in ~20 s) and stutters. After a long hunt the root cause turned out to be a one-liner:

The model's decode_stream, generate, and stream methods carry @torch.no_grad, but step_f and encode do not. The local stream calls step_f/encode directly (not through generate), so autograd stays on and builds a gradient graph every frame whose activations pile up in VRAM. The official Space never noticed: on ZeroGPU the 80 GB of headroom and the ~55 s grant-reset hide it.

The fix is to disable gradients in the streaming threads:

torch.set_grad_enabled(False)   # main thread (warm-up) AND the worker thread (thread-local)

With that single change, on the same RTX 4070:

Mode	Per-frame	VRAM over 1500 frames	Verdict
eager, autograd on (original)	70–80 ms, climbing	1 GB → 26 GB → OOM	unusable
eager, `no_grad`	~70 ms	flat ~1 GB	stable, not real-time
`torch.compile` + `no_grad`	~~24 ms (~~60%)	flat ~1 GB	smooth, real-time

If you maintain the upstream port: adding @torch.no_grad() to step_f (and encode) would fix this for everyone and likely cut VRAM on ZeroGPU too.

This has been proposed upstream: magenta-rt-jam · discussions/1

What `app_local.py` changes vs the upstream `app.py`

No ZeroGPU. A no-op shim replaces spaces.GPU; the stream runs on your local GPU.
torch.set_grad_enabled(False) in the main (warm-up) and worker threads — the fix above.
torch.compile (dynamic) on the per-frame step fns + a full warm-up at start (covers all KV-cache lengths up to temporal_max_past, so no recompiles mid-stream). The upstream fast path (AOTI) is unavailable: it's A10G-only, and AOTInductor is hard-blocked on Windows in PyTorch.
One model at a time (MRT_LOAD=small|base|both, default small) — 12 GB can't hold both.
In-memory state instead of JSON-on-disk for the conditioning slot (single process here).
Local paths, no weight re-download.

Requirements

Windows 10/11 (or Linux), an NVIDIA GPU with ≥ 12 GB (RTX 3060 12 GB / 4070 and up)
Python 3.11
A Hugging Face token with access to google/magenta-realtime-2

Setup

# 1. Get the official PyTorch port (the model code lives here)
git clone https://huggingface.co/spaces/magenta-community/magenta-rt-jam jam
# (uses GIT_LFS_SKIP_SMUDGE=1 if you only want the code, not the LFS demo assets)

# 2. Python env
py -3.11 -m venv .venv
.venv\Scripts\python -m pip install -r requirements-local.txt
.venv\Scripts\python -m pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

# 3. Download weights (~1.1 GB small, ~9.2 GB base) — needs your HF token
set HF_TOKEN=hf_xxx
.venv\Scripts\python download_weights.py small      # or: base / both

# 4. Drop our driver in next to the model package and run
copy app_local.py jam\app_local.py
cd jam
..\.venv\Scripts\python app_local.py

Open http://127.0.0.1:7860, press play, and play the on-screen keys (or a MIDI keyboard via the browser's Web MIDI — the UI has a MIDI input selector).

Environment variables

Var	Default	Meaning
`MAGENTA_HOME`	`./magenta-home`	where weights live (`<home>/magenta-rt-v2/checkpoints`)
`MRT_LOAD`	`small`	`small` \| `base` \| `both` (12 GB → keep `small`)
`MRT_COMPILE`	`1`	`1` = torch.compile (smooth); `0` = eager (slower, no warm-up wait)

Notes & gotchas

First start takes ~30–60 s: torch.compile warms every KV-cache shape before serving.
mrt2_base (2.4B) does not fit alongside small on 12 GB; load it alone (MRT_LOAD=base).
On torch 2.8 the port OOMs in eager and torch.compile needs MSVC — stay on 2.6.

Credits

Magenta RealTime 2 and the PyTorch port: © Google LLC, licensed under Apache-2.0. The model code and web UI come from the magenta-community/magenta-rt-jam Space and the google/magenta-realtime-2 weights.
This repo adds only app_local.py (the local driver) and this documentation. It is an independent, unofficial integration and is not affiliated with or endorsed by Google.

See NOTICE for attribution details.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support