kehang001 commited on Jun 4

Commit

f253902

0 Parent(s):

Duplicate from google/magenta-realtime-2

Browse files

Co-authored-by: Kehang Han <kehang001@users.noreply.huggingface.co>

Files changed (18) hide show

.gitattributes +41 -0
README.md +215 -0
checkpoints/mrt2_base.safetensors +3 -0
checkpoints/mrt2_small.safetensors +3 -0
models/mrt2_base/mrt2_base.mlxfn +3 -0
models/mrt2_base/mrt2_base_state.safetensors +3 -0
models/mrt2_small/mrt2_small.mlxfn +3 -0
models/mrt2_small/mrt2_small_state.safetensors +3 -0
resources/musiccoca/audio_preprocessor.tflite +3 -0
resources/musiccoca/mapper.tflite +3 -0
resources/musiccoca/music_encoder.tflite +3 -0
resources/musiccoca/pretrained_vector_quantizer.tflite +3 -0
resources/musiccoca/spm.model +3 -0
resources/musiccoca/text_encoder.tflite +3 -0
resources/spectrostream/decoder.safetensors +3 -0
resources/spectrostream/encoder.safetensors +3 -0
resources/spectrostream/quantizer.safetensors +3 -0
resources/spectrostream/spectrostream_encoder.mlxfn +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,41 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+resources/soundstream_encoder.mlxfn filter=lfs diff=lfs merge=lfs -text
+resources/spectrostream/soundstream_encoder.mlxfn filter=lfs diff=lfs merge=lfs -text
+resources/spectrostream/spectrostream_encoder.mlxfn filter=lfs diff=lfs merge=lfs -text
+models/v1v5_cfgcond_soup_x3424_14_int8_rvq12_cfgs0/v1v5_cfgcond_soup_x3424_14_int8_rvq12_cfgs0.mlxfn filter=lfs diff=lfs merge=lfs -text
+models/mrt2_base/mrt2_base.mlxfn filter=lfs diff=lfs merge=lfs -text
+models/mrt2_small/mrt2_small.mlxfn filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,215 @@

+---
+license: cc-by-4.0
+library_name: magenta-realtime-2
+pipeline_tag: text-to-audio
+---
+# Model Card for Magenta RealTime 2
+**Authors**: Google DeepMind
+**Resources**:
+-   [Get Started](https://magenta.withgoogle.com/mrt2)
+-   [Blog Post](https://magenta.withgoogle.com/magenta-realtime-2)
+-   [Repository](https://github.com/magenta/magenta-realtime)
+-   [HuggingFace](https://huggingface.co/google/magenta-realtime-2)
+## Terms of Use
+Magenta RealTime 2 is offered under a combination of licenses: the codebase is
+licensed under
+[Apache 2.0](https://github.com/magenta/magenta-realtime/blob/main/LICENSE), and
+the model weights under
+[Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode).
+In addition, we specify the following usage terms:
+Copyright 2026 Google LLC
+Use these materials responsibly and do not generate content, including outputs,
+that infringe or violate the rights of others, including rights in copyrighted
+content.
+Google claims no rights in outputs you generate using Magenta RealTime 2. You
+and your users are solely responsible for outputs and their subsequent uses.
+Unless required by applicable law or agreed to in writing, all software and
+materials distributed here under the Apache 2.0 or CC-BY licenses are
+distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
+either express or implied. See the licenses for the specific language governing
+permissions and limitations under those licenses. You are solely responsible for
+determining the appropriateness of using, reproducing, modifying, performing,
+displaying or distributing the software and materials, and any outputs, and
+assume any and all risks associated with your use or distribution of any of the
+software and materials, and any outputs, and your exercise of rights and
+permissions under the licenses.
+## Model Details
+Magenta RealTime 2 is an open music generation model from Google built for on
+device streaming generation with low-latency control. It is a
+[live music model](https://arxiv.org/abs/2508.04651) and a follow up to the
+prior [Magenta RealTime model](https://huggingface.co/google/magenta-realtime)
+and [Lyria RealTime API](http://goo.gle/lyria-realtime), offering on-device
+generation with richer control and lower latency. Magenta RealTime 2 enables the
+continuous generation of musical audio steered by text prompts, audio examples,
+and MIDI.
+### System Components
+Magenta RealTime 2 is composed of three components: SpectroStream, MusicCoCa,
+and an LLM. The structure is similar to that of the original Magenta RealTime,
+detailed [here](https://arxiv.org/abs/2508.04651). The primary difference is
+the LLM, which is now a Decoder-only model supporting frame-wise autoregression
+(rather than chunk-wise) and tuned for on-device streaming with frame-level
+control.
+1.  **SpectroStream** ([Li+ 25](https://arxiv.org/abs/2508.05207)) is a
+    discrete audio codec that converts stereo 48kHz audio into tokens.
+1.  **MusicCoCa** is a contrastive-trained model capable of embedding audio and
+    text into a common embedding space, building on
+    [Yu+ 22](https://arxiv.org/abs/2205.01917) and
+    [Huang+ 22](https://arxiv.org/abs/2208.12415).
+1.  A **decoder-only Transformer LLM** generates audio tokens given context
+    audio tokens, a tokenized MusicCoCa embedding, and MIDI tokens. There are
+    two configurations:
+      1. A `base` configuration with 2.4B parameters
+      1. A `small` configuration with 230M parameters
+### Inputs and outputs
+-   **SpectroStream RVQ codec**: Tokenizes high-fidelity music audio
+    -   **Encoder input / Decoder output**: Music audio waveforms, 48kHz stereo
+    -   **Encoder output / Decoder input**: Discrete audio tokens, 25Hz frame
+        rate, 64 RVQ depth, 10 bit codes, 16kbps
+-   **MusicCoCa**: Joint embeddings of text and music audio
+    -   **Input**: Music audio waveforms, 16kHz mono, or text representation of
+        music style e.g. "heavy metal"
+    -   **Output**: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit
+        codes
+-   **Decoder Transformer LLM**: Generates audio tokens given context, MIDI,
+    and style. At each timestep (codec frame), the model receives:
+    -   **Input**:
+        - (Context) SpectroStream tokens
+          - `base`: 25 frame (1s) windowed attention per layer, 20 layers
+          - `small`: 41 frame (~1.6s) windowed attention per layer, 12 layers
+          - Yields 20s effective receiptive field for both models
+        - (Style) 12 MusicCoCa tokens
+        - (MIDI) 128-dim multihot vector representing the state of each MIDI
+          pitch during this frame (0 = Off, 1 = Sustain, 2 = Onset, 3 = Sustain
+          or onset, model decides)
+    -   **Output**: 1 generated frame, 12 RVQ tokens
+## Uses
+Music generation models, in particular ones targeted for continuous real-time
+generation and control, have a wide range of applications across various
+industries and domains. The following list of potential uses is not
+comprehensive. The purpose of this list is to provide contextual information
+about the possible use-cases that the model creators considered as part of model
+training and development.
+-   **Interactive Music Creation**
+    -   Live Performance / Improvisation: These models can be used to generate
+        music in a live performance setting, controlled by performers
+        manipulating style embeddings or the audio context
+    -   Accessible Music-Making & Music Therapy: People with impediments to
+        using traditional instruments (skill gaps, disabilities, etc.) can
+        participate in communal jam sessions or solo music creation.
+    -   Video Games: Developers can create a custom soundtrack for users in
+        real-time based on their actions and environment.
+-   **Research**
+    -   Transfer learning: Researchers can leverage representations from
+        MusicCoCa and Magenta RT 2 to recognize musical information.
+-   **Personalization**
+    -   Musicians can finetune models with their own catalog to customize the
+        model to their style (fine tuning support coming soon).
+-   **Education**
+    -   Exploring Genres, Instruments, and History: Natural language prompting
+        enables users to quickly learn about and experiment with musical
+        concepts.
+### Out-of-Scope Use
+See our [Terms of Use](#terms-of-use) above for usage we consider out of scope.
+## Bias, Risks, and Limitations
+Magenta RT 2 supports the real-time generation and steering of instrumental
+music. The purpose and intention of this capability is to foster the
+development of new real-time, interactive co-creation workflows that seamlessly
+integrate with human-centered forms of musical creativity.
+Every AI music generation model, including Magenta RT 2, carries a risk of
+impacting the economic and cultural landscape of music. We aim to mitigate these
+risks through the following avenues:
+-   Prioritizing human-AI interaction as fundamental in the design of Magenta
+    RT 2.
+-   Distributing the model under a terms of service that prohibit developers
+    from generating outputs that infringe or violate the rights of others,
+    including rights in copyrighted content.
+-   Training on primarily instrumental data. With specific prompting, this model
+    has been observed to generate some vocal sounds and effects, though those
+    vocal sounds and effects tend to be non-lexical.
+### Known limitations
+Magenta RealTime 2 has similar limitations to Magenta RealTime in terms of
+genre coverage and non lexical vocalizations,
+[refer here for details](https://huggingface.co/google/magenta-realtime#known-limitations).
+### Benefits
+At the time of release, Magenta RealTime 2 represents the only open weights
+model supporting real-time, continuous musical audio generation with low
+latency control (~200ms). It is designed specifically to enable live,
+interactive musical creation, bringing new capabilities to musical
+performances, art installations, video games, and many other applications.
+## How to Get Started with the Model
+See our [Get Started Page](https://magenta.withgoogle.com/magenta-realtime-2)
+and [GitHub repository](https://github.com/magenta/magenta-realtime) for usage
+examples.
+## Training Details
+### Training Data
+Magenta RealTime 2 was trained on ~71k hours of stock music from multiple
+sources, mostly instrumental.
+### Hardware
+Magenta RealTime 2 was trained using
+[Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu)
+hardware.
+### Software
+Training was done using [JAX](https://github.com/jax-ml/jax) and
+[Sequence Layers](https://github.com/google/sequence-layers). JAX allows
+researchers to take advantage of the latest generation of hardware, including
+TPUs, for faster and more efficient training of large models.
+## Evaluation
+Model evaluation metrics and results will be shared in our forthcoming technical
+report.
+## Citation
+A paper about Magenta RealTime 2 is forthcoming. For now, please cite our
+previous technical report:
+**BibTeX:**
+```
+@inproceedings{gdmlyria2025live,
+    title={Live Music Models},
+    author={Caillon, Antoine and McWilliams, Brian and Tarakajian, Cassie and Simon, Ian and Manco, Ilaria and Engel, Jesse and Constant, Noah and Li, Pen and Denk, Timo I. and Lalama, Alberto and Agostinelli, Andrea and Huang, Anna and Manilow, Ethan and Brower, George and Erdogan, Hakan and Lei, Heidi and Rolnick, Itai and Grishchenko, Ivan and Orsini, Manu and Kastelic, Matej and Zuluaga, Mauricio and Verzetti, Mauro and Dooley, Michael and Skopek, Ondrej and Ferrer, Rafael and Borsos, Zal{\'a}n and van den Oord, {\"A}aron and Eck, Douglas and Collins, Eli and Baldridge, Jason and Hume, Tom and Donahue, Chris and Han, Kehang and Roberts, Adam},
+    booktitle={NeurIPS Creative AI},
+    year={2025}
+}
+```

checkpoints/mrt2_base.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60f3e813d9da4a41a166c734a3074e6d54254c2fc14b0817bad6b8d25cddc044
+size 9836760520

checkpoints/mrt2_small.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5dd1cbc7c606c512c21de0bcb04d4818bf0a3b873d7cbb9d1556d67d3b034de3
+size 1128840272

models/mrt2_base/mrt2_base.mlxfn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee2f19f2782182095fcd05c0fc1978f7f3e020b1cc0993e9d8e643e2f7de0bfb
+size 2771414746

models/mrt2_base/mrt2_base_state.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:88b302502aa5b467b74b0591adefd7769cb620211bf18606b7656a9ea57eef5f
+size 16939969

models/mrt2_small/mrt2_small.mlxfn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a70b0de30b3e6ad054fe6a61a7765408f01127628e6362c1abc328809a3c422
+size 455654550

models/mrt2_small/mrt2_small_state.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:23f1e05a6beea306fe39970bd61193f2d3e5fbd8f08af93570bda4ca9ec33255
+size 8676998

resources/musiccoca/audio_preprocessor.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:656ca4c358451c2b85932e66efcfd2ba62492f4435953d775bfc1d3c08329a30
+size 8729640

resources/musiccoca/mapper.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2f9743cc8f121a588b69c7f4d79a2a4111ce81864cbde8830054cd5e97f3d717
+size 86166664

resources/musiccoca/music_encoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d4501af799834e383d904c34ee826c61eb53c69682bf15a981a46c1bb32793a
+size 370935584

resources/musiccoca/pretrained_vector_quantizer.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a8a19e2119ad405818eae84a331a970f1a582b3389d4bfd27814f75b455a444
+size 72422108

resources/musiccoca/spm.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ff325a99b61ba5726cf6437cde6eefbb633dbaa363a684f7a97ed99b55202cca
+size 517448

resources/musiccoca/text_encoder.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e1222e3418cbe8cc2623939571bae8e9ab6f0d511404b0d83da69f4e6e11b272
+size 418674324

resources/spectrostream/decoder.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ac6f100a24945fb434783fde6acd7902ceaa8bca492ca317edfc75dd51c42dd
+size 209853216

resources/spectrostream/encoder.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f20c197ddcbb9cd43e1a97f9bee0d07d211f79966cd3862d9830a32885090f72
+size 37013392

resources/spectrostream/quantizer.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ba89dcb85344bb14f4f34b8f597c0d6adaa560002d4fd88879a2944c98a20f0
+size 67108984

resources/spectrostream/spectrostream_encoder.mlxfn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:887c25b21aa1714d19907fc96963c6440d5911f11571054cd5acf7306c260905
+size 104319983