meric / README.md

Add model card

d96e2ad verified 17 days ago

5.07 kB

	---
	license: other
	license_name: stabilityai-community
	license_link: https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md
	base_model:
	- stabilityai/stable-audio-open-1.0
	pipeline_tag: text-to-audio
	tags:
	- music-generation
	- multimodal
	- image-to-music
	- text-to-music
	- video-to-music
	- diffusion
	- flow-matching
	- cross-modal-retrieval
	- MuQ
	library_name: meric
	---

	# MERIC — Unified Multimodal Music Generation and Retrieval

	MERIC generates music from images, text, and video — and retrieves music for the same inputs — within a single framework, built around a music semantic anchor (the MuQ embedding space) that decouples multimodal understanding from acoustic synthesis.

	- Stage 1 maps any modality into the anchor space: image/text → Qwen3-VL embedding `[2048]` (primary) or CLIP ViT-H-14 `[1024]` (alternative) → a "Music Head" (RDM diffusion, ~20 steps) → MuQ `[512]`.
	- Stage 2 renders the anchor into 44.1 kHz audio: MuQ `[512]` → DiT + Flow Matching (~50 steps) → Oobleck VAE → audio.

	Decoupling the two stages lets Stage 2 train on large unpaired music corpora, model the inherent one-to-many nature of cross-modal generation by sampling in anchor space, and serve both generation and retrieval from one model.

	> Code: [github.com/TODO/meric](https://github.com/TODO/meric) · Paper: MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor (ECCV 2026)

	## Files in this repository

	\| File \| `meric` model name \| Description \|
	\|---\|---\|---\|
	\| `rdm_sft_v3.pth` \| `meric-sft-v3` \| Paper model "Meric" — Stage-1 Music Head, Qwen3-VL backbone, ARIA-finetuned (squaredcos + v-prediction + EMA + min-SNR). \|
	\| `rdm_sft_instrumental.pth` \| `meric-instrumental` \| Stage-1 Music Head trained on vocal-filtered data — cleaner instrumental output. \|
	\| `mericldm.ckpt` \| (shared Stage-2) \| The Stage-2 Flow Decoder (MuQ → audio). Both Music Heads run through it. \|

	The Stage-2 acoustic components it builds on (Stable Audio Open VAE/DiT, CLAP) are downloaded automatically from their public upstream repos at first use.

	## Usage

	Install the package, then generate — weights download automatically on first use:

	```bash
	pip install git+https://github.com/TODO/meric.git
	```

	```python
	from meric import MericPipeline

	pipe = MericPipeline.from_pretrained("meric-sft-v3", device="cuda:0")
	wavs = pipe.generate(image="photo.jpg", n=3, output_dir="out/") # list of WAV paths
	# also: pipe.generate(text="a calm piano melody with gentle rain")
	# cleaner instrumental output: MericPipeline.from_pretrained("meric-instrumental")
	```

	Or from the command line:

	```bash
	meric generate --image photo.jpg -o out/ # image -> music
	meric generate --text "lo-fi hip hop, 90 BPM" -o out/
	meric models # list available models
	```

	See the [repository](https://github.com/TODO/meric) and `docs/USAGE.md` for the full guide.

	## Intended use & limitations

	- Intended use: research and creative prototyping of music generation from visual/textual prompts, and cross-modal music retrieval.
	- The Qwen3-VL backbone (used by both published heads) requires the external Qwen3-VL embedding environment for image/video inputs; see the repository setup guide.
	- Generation is one-to-many and stochastic: different seeds yield different plausible scores for the same input. Use `meric-instrumental` when vocal-like artifacts are undesirable.
	- Outputs are AI-generated audio and may reflect biases of the training data; not intended for use as production music without review.

	## Training data

	The Stage-1 heads are fine-tuned on ARIA (Art-Referenced Instrumental Audio): source images (Unsplash + WikiArt) paired with model-generated captions and AI-generated instrumental music. Stage 2 is trained on large unpaired music corpora. ARIA's source images, captions, and audio carry their own upstream terms.

	## License & attribution

	- The MERIC source code is licensed under Apache-2.0.
	- These model weights are NOT Apache-2.0. The Stage-2 decoder derives from Stable Audio Open, so the released weights inherit the [Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md) — a non-OSI license with use restrictions (including commercial-use conditions). Review it before any redistribution or commercial use.
	- Built on: MuQ (music anchor encoder), Qwen3-VL (vision-language embeddings), OpenCLIP ViT-H-14, CLAP, and Stable Audio Open (VAE/DiT). Each carries its own upstream license — see the repository `NOTICE`.

	## Citation

	```bibtex
	@inproceedings{meric2026,
	title = {MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor},
	author = {MERIC authors},
	booktitle = {European Conference on Computer Vision (ECCV)},
	year = {2026},
	note = {TODO: update with final author list, pages, and DOI on publication}
	}
	```