MERIC β Unified Multimodal Music Generation and Retrieval
MERIC generates music from images, text, and video β and retrieves music for the same inputs β within a single framework, built around a music semantic anchor (the MuQ embedding space) that decouples multimodal understanding from acoustic synthesis.
- Stage 1 maps any modality into the anchor space: image/text β Qwen3-VL embedding
[2048](primary) or CLIP ViT-H-14[1024](alternative) β a "Music Head" (RDM diffusion, ~20 steps) β MuQ[512]. - Stage 2 renders the anchor into 44.1 kHz audio: MuQ
[512]β DiT + Flow Matching (~50 steps) β Oobleck VAE β audio.
Decoupling the two stages lets Stage 2 train on large unpaired music corpora, model the inherent one-to-many nature of cross-modal generation by sampling in anchor space, and serve both generation and retrieval from one model.
Code: github.com/TODO/meric Β· Paper: MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor (ECCV 2026)
Files in this repository
| File | meric model name |
Description |
|---|---|---|
rdm_sft_v3.pth |
meric-sft-v3 |
Paper model "Meric" β Stage-1 Music Head, Qwen3-VL backbone, ARIA-finetuned (squaredcos + v-prediction + EMA + min-SNR). |
rdm_sft_instrumental.pth |
meric-instrumental |
Stage-1 Music Head trained on vocal-filtered data β cleaner instrumental output. |
mericldm.ckpt |
(shared Stage-2) | The Stage-2 Flow Decoder (MuQ β audio). Both Music Heads run through it. |
The Stage-2 acoustic components it builds on (Stable Audio Open VAE/DiT, CLAP) are downloaded automatically from their public upstream repos at first use.
Usage
Install the package, then generate β weights download automatically on first use:
pip install git+https://github.com/TODO/meric.git
from meric import MericPipeline
pipe = MericPipeline.from_pretrained("meric-sft-v3", device="cuda:0")
wavs = pipe.generate(image="photo.jpg", n=3, output_dir="out/") # list of WAV paths
# also: pipe.generate(text="a calm piano melody with gentle rain")
# cleaner instrumental output: MericPipeline.from_pretrained("meric-instrumental")
Or from the command line:
meric generate --image photo.jpg -o out/ # image -> music
meric generate --text "lo-fi hip hop, 90 BPM" -o out/
meric models # list available models
See the repository and docs/USAGE.md for the full guide.
Intended use & limitations
- Intended use: research and creative prototyping of music generation from visual/textual prompts, and cross-modal music retrieval.
- The Qwen3-VL backbone (used by both published heads) requires the external Qwen3-VL embedding environment for image/video inputs; see the repository setup guide.
- Generation is one-to-many and stochastic: different seeds yield different plausible scores for the same input. Use
meric-instrumentalwhen vocal-like artifacts are undesirable. - Outputs are AI-generated audio and may reflect biases of the training data; not intended for use as production music without review.
Training data
The Stage-1 heads are fine-tuned on ARIA (Art-Referenced Instrumental Audio): source images (Unsplash + WikiArt) paired with model-generated captions and AI-generated instrumental music. Stage 2 is trained on large unpaired music corpora. ARIA's source images, captions, and audio carry their own upstream terms.
License & attribution
- The MERIC source code is licensed under Apache-2.0.
- These model weights are NOT Apache-2.0. The Stage-2 decoder derives from Stable Audio Open, so the released weights inherit the Stability AI Community License β a non-OSI license with use restrictions (including commercial-use conditions). Review it before any redistribution or commercial use.
- Built on: MuQ (music anchor encoder), Qwen3-VL (vision-language embeddings), OpenCLIP ViT-H-14, CLAP, and Stable Audio Open (VAE/DiT). Each carries its own upstream license β see the repository
NOTICE.
Citation
@inproceedings{meric2026,
title = {MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor},
author = {MERIC authors},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026},
note = {TODO: update with final author list, pages, and DOI on publication}
}
Model tree for Audiofool/meric
Base model
stabilityai/stable-audio-open-1.0