Magenta RealTime 2 β Local (Windows / consumer GPU)
Run Magenta RealTime 2 (Google's real-time music model) locally on a consumer
NVIDIA GPU β Windows included β with the same web UI as the official
magenta-community/magenta-rt-jam
Space, but without the ZeroGPU / spaces infrastructure.
Tested on an RTX 4070 (12 GB) on Windows 11: mrt2_small streams at ~60% of the real-time budget (β1.6Γ faster than real-time), smooth, with flat VRAM.
This is a thin local driver on top of the official PyTorch port. It does not modify any of Google's model code β everything lives in a single
app_local.py. Original work Β© Google LLC, Apache-2.0. See Credits.
Why this exists
The official Space runs on ZeroGPU (datacenter A100/H200, Linux) and ships an MLX path for Apple Silicon. There was no supported way to run it locally on a Windows / consumer NVIDIA GPU. This project fills that gap.
The official Space's app.py assumes:
import spaces+@spaces.GPU(ZeroGPU grant) β the browser hangs on "Obtaining ZeroGPUβ¦" with no cloud backend;- Linux paths (
/tmp,/data) and re-downloads weights on every boot; - preloading both models (small + base) at once β fine on an 80 GB A100, an instant OOM on 12 GB;
- AOTI artifacts compiled for A10G (sm_86) β they don't load on Ada (sm_89).
app_local.py resolves all of these for local use.
The key finding (the part worth reading)
Out of the box, streaming leaks VRAM (~17 MB/frame β OOM in ~20 s) and stutters. After a long hunt the root cause turned out to be a one-liner:
The model's
decode_stream,generate, andstreammethods carry@torch.no_grad, butstep_fandencodedo not. The local stream callsstep_f/encodedirectly (not throughgenerate), so autograd stays on and builds a gradient graph every frame whose activations pile up in VRAM. The official Space never noticed: on ZeroGPU the 80 GB of headroom and the ~55 s grant-reset hide it.
The fix is to disable gradients in the streaming threads:
torch.set_grad_enabled(False) # main thread (warm-up) AND the worker thread (thread-local)
With that single change, on the same RTX 4070:
| Mode | Per-frame | VRAM over 1500 frames | Verdict |
|---|---|---|---|
| eager, autograd on (original) | 70β80 ms, climbing | 1 GB β 26 GB β OOM | unusable |
eager, no_grad |
~70 ms | flat ~1 GB | stable, not real-time |
torch.compile + no_grad |
flat ~1 GB | smooth, real-time |
If you maintain the upstream port: adding
@torch.no_grad()tostep_f(andencode) would fix this for everyone and likely cut VRAM on ZeroGPU too.This has been proposed upstream: magenta-rt-jam Β· discussions/1
What app_local.py changes vs the upstream app.py
- No ZeroGPU. A no-op shim replaces
spaces.GPU; the stream runs on your local GPU. torch.set_grad_enabled(False)in the main (warm-up) and worker threads β the fix above.torch.compile(dynamic) on the per-frame step fns + a full warm-up at start (covers all KV-cache lengths up totemporal_max_past, so no recompiles mid-stream). The upstream fast path (AOTI) is unavailable: it's A10G-only, andAOTInductoris hard-blocked on Windows in PyTorch.- One model at a time (
MRT_LOAD=small|base|both, defaultsmall) β 12 GB can't hold both. - In-memory state instead of JSON-on-disk for the conditioning slot (single process here).
- Local paths, no weight re-download.
Requirements
- Windows 10/11 (or Linux), an NVIDIA GPU with β₯ 12 GB (RTX 3060 12 GB / 4070 and up)
- Python 3.11
- A Hugging Face token with access to
google/magenta-realtime-2
Setup
# 1. Get the official PyTorch port (the model code lives here)
git clone https://huggingface.co/spaces/magenta-community/magenta-rt-jam jam
# (uses GIT_LFS_SKIP_SMUDGE=1 if you only want the code, not the LFS demo assets)
# 2. Python env
py -3.11 -m venv .venv
.venv\Scripts\python -m pip install -r requirements-local.txt
.venv\Scripts\python -m pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# 3. Download weights (~1.1 GB small, ~9.2 GB base) β needs your HF token
set HF_TOKEN=hf_xxx
.venv\Scripts\python download_weights.py small # or: base / both
# 4. Drop our driver in next to the model package and run
copy app_local.py jam\app_local.py
cd jam
..\.venv\Scripts\python app_local.py
Open http://127.0.0.1:7860, press play, and play the on-screen keys (or a MIDI keyboard via the browser's Web MIDI β the UI has a MIDI input selector).
Environment variables
| Var | Default | Meaning |
|---|---|---|
MAGENTA_HOME |
./magenta-home |
where weights live (<home>/magenta-rt-v2/checkpoints) |
MRT_LOAD |
small |
small | base | both (12 GB β keep small) |
MRT_COMPILE |
1 |
1 = torch.compile (smooth); 0 = eager (slower, no warm-up wait) |
Notes & gotchas
- First start takes ~30β60 s:
torch.compilewarms every KV-cache shape before serving. mrt2_base(2.4B) does not fit alongsidesmallon 12 GB; load it alone (MRT_LOAD=base).- On torch 2.8 the port OOMs in eager and
torch.compileneeds MSVC β stay on 2.6.
Credits
- Magenta RealTime 2 and the PyTorch port: Β© Google LLC, licensed under
Apache-2.0. The model code and web UI come from the
magenta-community/magenta-rt-jamSpace and thegoogle/magenta-realtime-2weights. - This repo adds only
app_local.py(the local driver) and this documentation. It is an independent, unofficial integration and is not affiliated with or endorsed by Google.
See NOTICE for attribution details.