Add model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: stabilityai-community
|
| 4 |
+
license_link: https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md
|
| 5 |
+
base_model:
|
| 6 |
+
- stabilityai/stable-audio-open-1.0
|
| 7 |
+
pipeline_tag: text-to-audio
|
| 8 |
+
tags:
|
| 9 |
+
- music-generation
|
| 10 |
+
- multimodal
|
| 11 |
+
- image-to-music
|
| 12 |
+
- text-to-music
|
| 13 |
+
- video-to-music
|
| 14 |
+
- diffusion
|
| 15 |
+
- flow-matching
|
| 16 |
+
- cross-modal-retrieval
|
| 17 |
+
- MuQ
|
| 18 |
+
library_name: meric
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# MERIC β Unified Multimodal Music Generation and Retrieval
|
| 22 |
+
|
| 23 |
+
MERIC generates music from **images, text, and video** β and retrieves music for the same inputs β within a single framework, built around a **music semantic anchor** (the MuQ embedding space) that decouples *multimodal understanding* from *acoustic synthesis*.
|
| 24 |
+
|
| 25 |
+
- **Stage 1** maps any modality into the anchor space: image/text β Qwen3-VL embedding `[2048]` (primary) or CLIP ViT-H-14 `[1024]` (alternative) β a "Music Head" (RDM diffusion, ~20 steps) β MuQ `[512]`.
|
| 26 |
+
- **Stage 2** renders the anchor into 44.1 kHz audio: MuQ `[512]` β DiT + Flow Matching (~50 steps) β Oobleck VAE β audio.
|
| 27 |
+
|
| 28 |
+
Decoupling the two stages lets Stage 2 train on large **unpaired** music corpora, model the inherent **one-to-many** nature of cross-modal generation by sampling in anchor space, and serve both **generation and retrieval** from one model.
|
| 29 |
+
|
| 30 |
+
> Code: [github.com/TODO/meric](https://github.com/TODO/meric) Β· Paper: *MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor* (ECCV 2026)
|
| 31 |
+
|
| 32 |
+
## Files in this repository
|
| 33 |
+
|
| 34 |
+
| File | `meric` model name | Description |
|
| 35 |
+
|---|---|---|
|
| 36 |
+
| `rdm_sft_v3.pth` | `meric-sft-v3` | **Paper model "Meric"** β Stage-1 Music Head, Qwen3-VL backbone, ARIA-finetuned (squaredcos + v-prediction + EMA + min-SNR). |
|
| 37 |
+
| `rdm_sft_instrumental.pth` | `meric-instrumental` | Stage-1 Music Head trained on vocal-filtered data β cleaner **instrumental** output. |
|
| 38 |
+
| `mericldm.ckpt` | *(shared Stage-2)* | The Stage-2 Flow Decoder (MuQ β audio). Both Music Heads run through it. |
|
| 39 |
+
|
| 40 |
+
The Stage-2 acoustic components it builds on (Stable Audio Open VAE/DiT, CLAP) are downloaded automatically from their public upstream repos at first use.
|
| 41 |
+
|
| 42 |
+
## Usage
|
| 43 |
+
|
| 44 |
+
Install the package, then generate β weights download automatically on first use:
|
| 45 |
+
|
| 46 |
+
```bash
|
| 47 |
+
pip install git+https://github.com/TODO/meric.git
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from meric import MericPipeline
|
| 52 |
+
|
| 53 |
+
pipe = MericPipeline.from_pretrained("meric-sft-v3", device="cuda:0")
|
| 54 |
+
wavs = pipe.generate(image="photo.jpg", n=3, output_dir="out/") # list of WAV paths
|
| 55 |
+
# also: pipe.generate(text="a calm piano melody with gentle rain")
|
| 56 |
+
# cleaner instrumental output: MericPipeline.from_pretrained("meric-instrumental")
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
Or from the command line:
|
| 60 |
+
|
| 61 |
+
```bash
|
| 62 |
+
meric generate --image photo.jpg -o out/ # image -> music
|
| 63 |
+
meric generate --text "lo-fi hip hop, 90 BPM" -o out/
|
| 64 |
+
meric models # list available models
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
See the [repository](https://github.com/TODO/meric) and `docs/USAGE.md` for the full guide.
|
| 68 |
+
|
| 69 |
+
## Intended use & limitations
|
| 70 |
+
|
| 71 |
+
- **Intended use**: research and creative prototyping of music generation from visual/textual prompts, and cross-modal music retrieval.
|
| 72 |
+
- The Qwen3-VL backbone (used by both published heads) requires the external Qwen3-VL embedding environment for image/video inputs; see the repository setup guide.
|
| 73 |
+
- Generation is **one-to-many** and stochastic: different seeds yield different plausible scores for the same input. Use `meric-instrumental` when vocal-like artifacts are undesirable.
|
| 74 |
+
- Outputs are AI-generated audio and may reflect biases of the training data; not intended for use as production music without review.
|
| 75 |
+
|
| 76 |
+
## Training data
|
| 77 |
+
|
| 78 |
+
The Stage-1 heads are fine-tuned on **ARIA** (Art-Referenced Instrumental Audio): source images (Unsplash + WikiArt) paired with model-generated captions and AI-generated instrumental music. Stage 2 is trained on large unpaired music corpora. ARIA's source images, captions, and audio carry their own upstream terms.
|
| 79 |
+
|
| 80 |
+
## License & attribution
|
| 81 |
+
|
| 82 |
+
- The MERIC **source code** is licensed under **Apache-2.0**.
|
| 83 |
+
- **These model weights are NOT Apache-2.0.** The Stage-2 decoder derives from **Stable Audio Open**, so the released weights inherit the **[Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md)** β a non-OSI license with use restrictions (including commercial-use conditions). Review it before any redistribution or commercial use.
|
| 84 |
+
- Built on: **MuQ** (music anchor encoder), **Qwen3-VL** (vision-language embeddings), **OpenCLIP ViT-H-14**, **CLAP**, and **Stable Audio Open** (VAE/DiT). Each carries its own upstream license β see the repository `NOTICE`.
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
|
| 88 |
+
```bibtex
|
| 89 |
+
@inproceedings{meric2026,
|
| 90 |
+
title = {MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor},
|
| 91 |
+
author = {MERIC authors},
|
| 92 |
+
booktitle = {European Conference on Computer Vision (ECCV)},
|
| 93 |
+
year = {2026},
|
| 94 |
+
note = {TODO: update with final author list, pages, and DOI on publication}
|
| 95 |
+
}
|
| 96 |
+
```
|