Text-to-Audio
Magenta RT 2
LiteRT
PhysShell kehang001 commited on
Commit
f253902
·
0 Parent(s):

Duplicate from google/magenta-realtime-2

Browse files

Co-authored-by: Kehang Han <kehang001@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ resources/soundstream_encoder.mlxfn filter=lfs diff=lfs merge=lfs -text
37
+ resources/spectrostream/soundstream_encoder.mlxfn filter=lfs diff=lfs merge=lfs -text
38
+ resources/spectrostream/spectrostream_encoder.mlxfn filter=lfs diff=lfs merge=lfs -text
39
+ models/v1v5_cfgcond_soup_x3424_14_int8_rvq12_cfgs0/v1v5_cfgcond_soup_x3424_14_int8_rvq12_cfgs0.mlxfn filter=lfs diff=lfs merge=lfs -text
40
+ models/mrt2_base/mrt2_base.mlxfn filter=lfs diff=lfs merge=lfs -text
41
+ models/mrt2_small/mrt2_small.mlxfn filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ library_name: magenta-realtime-2
4
+ pipeline_tag: text-to-audio
5
+ ---
6
+
7
+ # Model Card for Magenta RealTime 2
8
+
9
+ **Authors**: Google DeepMind
10
+
11
+ **Resources**:
12
+
13
+ - [Get Started](https://magenta.withgoogle.com/mrt2)
14
+ - [Blog Post](https://magenta.withgoogle.com/magenta-realtime-2)
15
+ - [Repository](https://github.com/magenta/magenta-realtime)
16
+ - [HuggingFace](https://huggingface.co/google/magenta-realtime-2)
17
+
18
+ ## Terms of Use
19
+
20
+ Magenta RealTime 2 is offered under a combination of licenses: the codebase is
21
+ licensed under
22
+ [Apache 2.0](https://github.com/magenta/magenta-realtime/blob/main/LICENSE), and
23
+ the model weights under
24
+ [Creative Commons Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/legalcode).
25
+ In addition, we specify the following usage terms:
26
+
27
+ Copyright 2026 Google LLC
28
+
29
+ Use these materials responsibly and do not generate content, including outputs,
30
+ that infringe or violate the rights of others, including rights in copyrighted
31
+ content.
32
+
33
+ Google claims no rights in outputs you generate using Magenta RealTime 2. You
34
+ and your users are solely responsible for outputs and their subsequent uses.
35
+
36
+ Unless required by applicable law or agreed to in writing, all software and
37
+ materials distributed here under the Apache 2.0 or CC-BY licenses are
38
+ distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
39
+ either express or implied. See the licenses for the specific language governing
40
+ permissions and limitations under those licenses. You are solely responsible for
41
+ determining the appropriateness of using, reproducing, modifying, performing,
42
+ displaying or distributing the software and materials, and any outputs, and
43
+ assume any and all risks associated with your use or distribution of any of the
44
+ software and materials, and any outputs, and your exercise of rights and
45
+ permissions under the licenses.
46
+
47
+ ## Model Details
48
+
49
+ Magenta RealTime 2 is an open music generation model from Google built for on
50
+ device streaming generation with low-latency control. It is a
51
+ [live music model](https://arxiv.org/abs/2508.04651) and a follow up to the
52
+ prior [Magenta RealTime model](https://huggingface.co/google/magenta-realtime)
53
+ and [Lyria RealTime API](http://goo.gle/lyria-realtime), offering on-device
54
+ generation with richer control and lower latency. Magenta RealTime 2 enables the
55
+ continuous generation of musical audio steered by text prompts, audio examples,
56
+ and MIDI.
57
+
58
+ ### System Components
59
+
60
+ Magenta RealTime 2 is composed of three components: SpectroStream, MusicCoCa,
61
+ and an LLM. The structure is similar to that of the original Magenta RealTime,
62
+ detailed [here](https://arxiv.org/abs/2508.04651). The primary difference is
63
+ the LLM, which is now a Decoder-only model supporting frame-wise autoregression
64
+ (rather than chunk-wise) and tuned for on-device streaming with frame-level
65
+ control.
66
+
67
+ 1. **SpectroStream** ([Li+ 25](https://arxiv.org/abs/2508.05207)) is a
68
+ discrete audio codec that converts stereo 48kHz audio into tokens.
69
+ 1. **MusicCoCa** is a contrastive-trained model capable of embedding audio and
70
+ text into a common embedding space, building on
71
+ [Yu+ 22](https://arxiv.org/abs/2205.01917) and
72
+ [Huang+ 22](https://arxiv.org/abs/2208.12415).
73
+ 1. A **decoder-only Transformer LLM** generates audio tokens given context
74
+ audio tokens, a tokenized MusicCoCa embedding, and MIDI tokens. There are
75
+ two configurations:
76
+ 1. A `base` configuration with 2.4B parameters
77
+ 1. A `small` configuration with 230M parameters
78
+
79
+ ### Inputs and outputs
80
+
81
+ - **SpectroStream RVQ codec**: Tokenizes high-fidelity music audio
82
+ - **Encoder input / Decoder output**: Music audio waveforms, 48kHz stereo
83
+ - **Encoder output / Decoder input**: Discrete audio tokens, 25Hz frame
84
+ rate, 64 RVQ depth, 10 bit codes, 16kbps
85
+ - **MusicCoCa**: Joint embeddings of text and music audio
86
+ - **Input**: Music audio waveforms, 16kHz mono, or text representation of
87
+ music style e.g. "heavy metal"
88
+ - **Output**: 768 dimensional embedding, quantized to 12 RVQ depth, 10 bit
89
+ codes
90
+ - **Decoder Transformer LLM**: Generates audio tokens given context, MIDI,
91
+ and style. At each timestep (codec frame), the model receives:
92
+ - **Input**:
93
+ - (Context) SpectroStream tokens
94
+ - `base`: 25 frame (1s) windowed attention per layer, 20 layers
95
+ - `small`: 41 frame (~1.6s) windowed attention per layer, 12 layers
96
+ - Yields 20s effective receiptive field for both models
97
+ - (Style) 12 MusicCoCa tokens
98
+ - (MIDI) 128-dim multihot vector representing the state of each MIDI
99
+ pitch during this frame (0 = Off, 1 = Sustain, 2 = Onset, 3 = Sustain
100
+ or onset, model decides)
101
+ - **Output**: 1 generated frame, 12 RVQ tokens
102
+
103
+ ## Uses
104
+
105
+ Music generation models, in particular ones targeted for continuous real-time
106
+ generation and control, have a wide range of applications across various
107
+ industries and domains. The following list of potential uses is not
108
+ comprehensive. The purpose of this list is to provide contextual information
109
+ about the possible use-cases that the model creators considered as part of model
110
+ training and development.
111
+
112
+ - **Interactive Music Creation**
113
+ - Live Performance / Improvisation: These models can be used to generate
114
+ music in a live performance setting, controlled by performers
115
+ manipulating style embeddings or the audio context
116
+ - Accessible Music-Making & Music Therapy: People with impediments to
117
+ using traditional instruments (skill gaps, disabilities, etc.) can
118
+ participate in communal jam sessions or solo music creation.
119
+ - Video Games: Developers can create a custom soundtrack for users in
120
+ real-time based on their actions and environment.
121
+ - **Research**
122
+ - Transfer learning: Researchers can leverage representations from
123
+ MusicCoCa and Magenta RT 2 to recognize musical information.
124
+ - **Personalization**
125
+ - Musicians can finetune models with their own catalog to customize the
126
+ model to their style (fine tuning support coming soon).
127
+ - **Education**
128
+ - Exploring Genres, Instruments, and History: Natural language prompting
129
+ enables users to quickly learn about and experiment with musical
130
+ concepts.
131
+
132
+ ### Out-of-Scope Use
133
+
134
+ See our [Terms of Use](#terms-of-use) above for usage we consider out of scope.
135
+
136
+ ## Bias, Risks, and Limitations
137
+
138
+ Magenta RT 2 supports the real-time generation and steering of instrumental
139
+ music. The purpose and intention of this capability is to foster the
140
+ development of new real-time, interactive co-creation workflows that seamlessly
141
+ integrate with human-centered forms of musical creativity.
142
+
143
+ Every AI music generation model, including Magenta RT 2, carries a risk of
144
+ impacting the economic and cultural landscape of music. We aim to mitigate these
145
+ risks through the following avenues:
146
+
147
+ - Prioritizing human-AI interaction as fundamental in the design of Magenta
148
+ RT 2.
149
+ - Distributing the model under a terms of service that prohibit developers
150
+ from generating outputs that infringe or violate the rights of others,
151
+ including rights in copyrighted content.
152
+ - Training on primarily instrumental data. With specific prompting, this model
153
+ has been observed to generate some vocal sounds and effects, though those
154
+ vocal sounds and effects tend to be non-lexical.
155
+
156
+ ### Known limitations
157
+
158
+ Magenta RealTime 2 has similar limitations to Magenta RealTime in terms of
159
+ genre coverage and non lexical vocalizations,
160
+ [refer here for details](https://huggingface.co/google/magenta-realtime#known-limitations).
161
+
162
+ ### Benefits
163
+
164
+ At the time of release, Magenta RealTime 2 represents the only open weights
165
+ model supporting real-time, continuous musical audio generation with low
166
+ latency control (~200ms). It is designed specifically to enable live,
167
+ interactive musical creation, bringing new capabilities to musical
168
+ performances, art installations, video games, and many other applications.
169
+
170
+ ## How to Get Started with the Model
171
+
172
+ See our [Get Started Page](https://magenta.withgoogle.com/magenta-realtime-2)
173
+ and [GitHub repository](https://github.com/magenta/magenta-realtime) for usage
174
+ examples.
175
+
176
+ ## Training Details
177
+
178
+ ### Training Data
179
+
180
+ Magenta RealTime 2 was trained on ~71k hours of stock music from multiple
181
+ sources, mostly instrumental.
182
+
183
+ ### Hardware
184
+
185
+ Magenta RealTime 2 was trained using
186
+ [Tensor Processing Unit (TPU)](https://cloud.google.com/tpu/docs/intro-to-tpu)
187
+ hardware.
188
+
189
+ ### Software
190
+
191
+ Training was done using [JAX](https://github.com/jax-ml/jax) and
192
+ [Sequence Layers](https://github.com/google/sequence-layers). JAX allows
193
+ researchers to take advantage of the latest generation of hardware, including
194
+ TPUs, for faster and more efficient training of large models.
195
+
196
+ ## Evaluation
197
+
198
+ Model evaluation metrics and results will be shared in our forthcoming technical
199
+ report.
200
+
201
+ ## Citation
202
+
203
+ A paper about Magenta RealTime 2 is forthcoming. For now, please cite our
204
+ previous technical report:
205
+
206
+ **BibTeX:**
207
+
208
+ ```
209
+ @inproceedings{gdmlyria2025live,
210
+ title={Live Music Models},
211
+ author={Caillon, Antoine and McWilliams, Brian and Tarakajian, Cassie and Simon, Ian and Manco, Ilaria and Engel, Jesse and Constant, Noah and Li, Pen and Denk, Timo I. and Lalama, Alberto and Agostinelli, Andrea and Huang, Anna and Manilow, Ethan and Brower, George and Erdogan, Hakan and Lei, Heidi and Rolnick, Itai and Grishchenko, Ivan and Orsini, Manu and Kastelic, Matej and Zuluaga, Mauricio and Verzetti, Mauro and Dooley, Michael and Skopek, Ondrej and Ferrer, Rafael and Borsos, Zal{\'a}n and van den Oord, {\"A}aron and Eck, Douglas and Collins, Eli and Baldridge, Jason and Hume, Tom and Donahue, Chris and Han, Kehang and Roberts, Adam},
212
+ booktitle={NeurIPS Creative AI},
213
+ year={2025}
214
+ }
215
+ ```
checkpoints/mrt2_base.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60f3e813d9da4a41a166c734a3074e6d54254c2fc14b0817bad6b8d25cddc044
3
+ size 9836760520
checkpoints/mrt2_small.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5dd1cbc7c606c512c21de0bcb04d4818bf0a3b873d7cbb9d1556d67d3b034de3
3
+ size 1128840272
models/mrt2_base/mrt2_base.mlxfn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee2f19f2782182095fcd05c0fc1978f7f3e020b1cc0993e9d8e643e2f7de0bfb
3
+ size 2771414746
models/mrt2_base/mrt2_base_state.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:88b302502aa5b467b74b0591adefd7769cb620211bf18606b7656a9ea57eef5f
3
+ size 16939969
models/mrt2_small/mrt2_small.mlxfn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a70b0de30b3e6ad054fe6a61a7765408f01127628e6362c1abc328809a3c422
3
+ size 455654550
models/mrt2_small/mrt2_small_state.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23f1e05a6beea306fe39970bd61193f2d3e5fbd8f08af93570bda4ca9ec33255
3
+ size 8676998
resources/musiccoca/audio_preprocessor.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:656ca4c358451c2b85932e66efcfd2ba62492f4435953d775bfc1d3c08329a30
3
+ size 8729640
resources/musiccoca/mapper.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2f9743cc8f121a588b69c7f4d79a2a4111ce81864cbde8830054cd5e97f3d717
3
+ size 86166664
resources/musiccoca/music_encoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d4501af799834e383d904c34ee826c61eb53c69682bf15a981a46c1bb32793a
3
+ size 370935584
resources/musiccoca/pretrained_vector_quantizer.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a8a19e2119ad405818eae84a331a970f1a582b3389d4bfd27814f75b455a444
3
+ size 72422108
resources/musiccoca/spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff325a99b61ba5726cf6437cde6eefbb633dbaa363a684f7a97ed99b55202cca
3
+ size 517448
resources/musiccoca/text_encoder.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e1222e3418cbe8cc2623939571bae8e9ab6f0d511404b0d83da69f4e6e11b272
3
+ size 418674324
resources/spectrostream/decoder.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ac6f100a24945fb434783fde6acd7902ceaa8bca492ca317edfc75dd51c42dd
3
+ size 209853216
resources/spectrostream/encoder.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f20c197ddcbb9cd43e1a97f9bee0d07d211f79966cd3862d9830a32885090f72
3
+ size 37013392
resources/spectrostream/quantizer.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ba89dcb85344bb14f4f34b8f597c0d6adaa560002d4fd88879a2944c98a20f0
3
+ size 67108984
resources/spectrostream/spectrostream_encoder.mlxfn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:887c25b21aa1714d19907fc96963c6440d5911f11571054cd5acf7306c260905
3
+ size 104319983