| --- |
| license: other |
| license_name: stabilityai-community |
| license_link: https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md |
| base_model: |
| - stabilityai/stable-audio-open-1.0 |
| pipeline_tag: text-to-audio |
| tags: |
| - music-generation |
| - multimodal |
| - image-to-music |
| - text-to-music |
| - video-to-music |
| - diffusion |
| - flow-matching |
| - cross-modal-retrieval |
| - MuQ |
| library_name: meric |
| --- |
| |
| # MERIC β Unified Multimodal Music Generation and Retrieval |
|
|
| MERIC generates music from **images, text, and video** β and retrieves music for the same inputs β within a single framework, built around a **music semantic anchor** (the MuQ embedding space) that decouples *multimodal understanding* from *acoustic synthesis*. |
|
|
| - **Stage 1** maps any modality into the anchor space: image/text β Qwen3-VL embedding `[2048]` (primary) or CLIP ViT-H-14 `[1024]` (alternative) β a "Music Head" (RDM diffusion, ~20 steps) β MuQ `[512]`. |
| - **Stage 2** renders the anchor into 44.1 kHz audio: MuQ `[512]` β DiT + Flow Matching (~50 steps) β Oobleck VAE β audio. |
|
|
| Decoupling the two stages lets Stage 2 train on large **unpaired** music corpora, model the inherent **one-to-many** nature of cross-modal generation by sampling in anchor space, and serve both **generation and retrieval** from one model. |
|
|
| > Code: [github.com/TODO/meric](https://github.com/TODO/meric) Β· Paper: *MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor* (ECCV 2026) |
|
|
| ## Files in this repository |
|
|
| | File | `meric` model name | Description | |
| |---|---|---| |
| | `rdm_sft_v3.pth` | `meric-sft-v3` | **Paper model "Meric"** β Stage-1 Music Head, Qwen3-VL backbone, ARIA-finetuned (squaredcos + v-prediction + EMA + min-SNR). | |
| | `rdm_sft_instrumental.pth` | `meric-instrumental` | Stage-1 Music Head trained on vocal-filtered data β cleaner **instrumental** output. | |
| | `mericldm.ckpt` | *(shared Stage-2)* | The Stage-2 Flow Decoder (MuQ β audio). Both Music Heads run through it. | |
|
|
| The Stage-2 acoustic components it builds on (Stable Audio Open VAE/DiT, CLAP) are downloaded automatically from their public upstream repos at first use. |
|
|
| ## Usage |
|
|
| Install the package, then generate β weights download automatically on first use: |
|
|
| ```bash |
| pip install git+https://github.com/TODO/meric.git |
| ``` |
|
|
| ```python |
| from meric import MericPipeline |
| |
| pipe = MericPipeline.from_pretrained("meric-sft-v3", device="cuda:0") |
| wavs = pipe.generate(image="photo.jpg", n=3, output_dir="out/") # list of WAV paths |
| # also: pipe.generate(text="a calm piano melody with gentle rain") |
| # cleaner instrumental output: MericPipeline.from_pretrained("meric-instrumental") |
| ``` |
|
|
| Or from the command line: |
|
|
| ```bash |
| meric generate --image photo.jpg -o out/ # image -> music |
| meric generate --text "lo-fi hip hop, 90 BPM" -o out/ |
| meric models # list available models |
| ``` |
|
|
| See the [repository](https://github.com/TODO/meric) and `docs/USAGE.md` for the full guide. |
|
|
| ## Intended use & limitations |
|
|
| - **Intended use**: research and creative prototyping of music generation from visual/textual prompts, and cross-modal music retrieval. |
| - The Qwen3-VL backbone (used by both published heads) requires the external Qwen3-VL embedding environment for image/video inputs; see the repository setup guide. |
| - Generation is **one-to-many** and stochastic: different seeds yield different plausible scores for the same input. Use `meric-instrumental` when vocal-like artifacts are undesirable. |
| - Outputs are AI-generated audio and may reflect biases of the training data; not intended for use as production music without review. |
|
|
| ## Training data |
|
|
| The Stage-1 heads are fine-tuned on **ARIA** (Art-Referenced Instrumental Audio): source images (Unsplash + WikiArt) paired with model-generated captions and AI-generated instrumental music. Stage 2 is trained on large unpaired music corpora. ARIA's source images, captions, and audio carry their own upstream terms. |
|
|
| ## License & attribution |
|
|
| - The MERIC **source code** is licensed under **Apache-2.0**. |
| - **These model weights are NOT Apache-2.0.** The Stage-2 decoder derives from **Stable Audio Open**, so the released weights inherit the **[Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md)** β a non-OSI license with use restrictions (including commercial-use conditions). Review it before any redistribution or commercial use. |
| - Built on: **MuQ** (music anchor encoder), **Qwen3-VL** (vision-language embeddings), **OpenCLIP ViT-H-14**, **CLAP**, and **Stable Audio Open** (VAE/DiT). Each carries its own upstream license β see the repository `NOTICE`. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{meric2026, |
| title = {MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor}, |
| author = {MERIC authors}, |
| booktitle = {European Conference on Computer Vision (ECCV)}, |
| year = {2026}, |
| note = {TODO: update with final author list, pages, and DOI on publication} |
| } |
| ``` |
|
|