Audiofool commited on
Commit
d96e2ad
Β·
verified Β·
1 Parent(s): 1561599

Add model card

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: stabilityai-community
4
+ license_link: https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md
5
+ base_model:
6
+ - stabilityai/stable-audio-open-1.0
7
+ pipeline_tag: text-to-audio
8
+ tags:
9
+ - music-generation
10
+ - multimodal
11
+ - image-to-music
12
+ - text-to-music
13
+ - video-to-music
14
+ - diffusion
15
+ - flow-matching
16
+ - cross-modal-retrieval
17
+ - MuQ
18
+ library_name: meric
19
+ ---
20
+
21
+ # MERIC β€” Unified Multimodal Music Generation and Retrieval
22
+
23
+ MERIC generates music from **images, text, and video** β€” and retrieves music for the same inputs β€” within a single framework, built around a **music semantic anchor** (the MuQ embedding space) that decouples *multimodal understanding* from *acoustic synthesis*.
24
+
25
+ - **Stage 1** maps any modality into the anchor space: image/text β†’ Qwen3-VL embedding `[2048]` (primary) or CLIP ViT-H-14 `[1024]` (alternative) β†’ a "Music Head" (RDM diffusion, ~20 steps) β†’ MuQ `[512]`.
26
+ - **Stage 2** renders the anchor into 44.1 kHz audio: MuQ `[512]` β†’ DiT + Flow Matching (~50 steps) β†’ Oobleck VAE β†’ audio.
27
+
28
+ Decoupling the two stages lets Stage 2 train on large **unpaired** music corpora, model the inherent **one-to-many** nature of cross-modal generation by sampling in anchor space, and serve both **generation and retrieval** from one model.
29
+
30
+ > Code: [github.com/TODO/meric](https://github.com/TODO/meric) Β· Paper: *MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor* (ECCV 2026)
31
+
32
+ ## Files in this repository
33
+
34
+ | File | `meric` model name | Description |
35
+ |---|---|---|
36
+ | `rdm_sft_v3.pth` | `meric-sft-v3` | **Paper model "Meric"** β€” Stage-1 Music Head, Qwen3-VL backbone, ARIA-finetuned (squaredcos + v-prediction + EMA + min-SNR). |
37
+ | `rdm_sft_instrumental.pth` | `meric-instrumental` | Stage-1 Music Head trained on vocal-filtered data β€” cleaner **instrumental** output. |
38
+ | `mericldm.ckpt` | *(shared Stage-2)* | The Stage-2 Flow Decoder (MuQ β†’ audio). Both Music Heads run through it. |
39
+
40
+ The Stage-2 acoustic components it builds on (Stable Audio Open VAE/DiT, CLAP) are downloaded automatically from their public upstream repos at first use.
41
+
42
+ ## Usage
43
+
44
+ Install the package, then generate β€” weights download automatically on first use:
45
+
46
+ ```bash
47
+ pip install git+https://github.com/TODO/meric.git
48
+ ```
49
+
50
+ ```python
51
+ from meric import MericPipeline
52
+
53
+ pipe = MericPipeline.from_pretrained("meric-sft-v3", device="cuda:0")
54
+ wavs = pipe.generate(image="photo.jpg", n=3, output_dir="out/") # list of WAV paths
55
+ # also: pipe.generate(text="a calm piano melody with gentle rain")
56
+ # cleaner instrumental output: MericPipeline.from_pretrained("meric-instrumental")
57
+ ```
58
+
59
+ Or from the command line:
60
+
61
+ ```bash
62
+ meric generate --image photo.jpg -o out/ # image -> music
63
+ meric generate --text "lo-fi hip hop, 90 BPM" -o out/
64
+ meric models # list available models
65
+ ```
66
+
67
+ See the [repository](https://github.com/TODO/meric) and `docs/USAGE.md` for the full guide.
68
+
69
+ ## Intended use & limitations
70
+
71
+ - **Intended use**: research and creative prototyping of music generation from visual/textual prompts, and cross-modal music retrieval.
72
+ - The Qwen3-VL backbone (used by both published heads) requires the external Qwen3-VL embedding environment for image/video inputs; see the repository setup guide.
73
+ - Generation is **one-to-many** and stochastic: different seeds yield different plausible scores for the same input. Use `meric-instrumental` when vocal-like artifacts are undesirable.
74
+ - Outputs are AI-generated audio and may reflect biases of the training data; not intended for use as production music without review.
75
+
76
+ ## Training data
77
+
78
+ The Stage-1 heads are fine-tuned on **ARIA** (Art-Referenced Instrumental Audio): source images (Unsplash + WikiArt) paired with model-generated captions and AI-generated instrumental music. Stage 2 is trained on large unpaired music corpora. ARIA's source images, captions, and audio carry their own upstream terms.
79
+
80
+ ## License & attribution
81
+
82
+ - The MERIC **source code** is licensed under **Apache-2.0**.
83
+ - **These model weights are NOT Apache-2.0.** The Stage-2 decoder derives from **Stable Audio Open**, so the released weights inherit the **[Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-open-1.0/blob/main/LICENSE.md)** β€” a non-OSI license with use restrictions (including commercial-use conditions). Review it before any redistribution or commercial use.
84
+ - Built on: **MuQ** (music anchor encoder), **Qwen3-VL** (vision-language embeddings), **OpenCLIP ViT-H-14**, **CLAP**, and **Stable Audio Open** (VAE/DiT). Each carries its own upstream license β€” see the repository `NOTICE`.
85
+
86
+ ## Citation
87
+
88
+ ```bibtex
89
+ @inproceedings{meric2026,
90
+ title = {MERIC: Unified Multimodal Music Generation and Retrieval via a Music Semantic Anchor},
91
+ author = {MERIC authors},
92
+ booktitle = {European Conference on Computer Vision (ECCV)},
93
+ year = {2026},
94
+ note = {TODO: update with final author list, pages, and DOI on publication}
95
+ }
96
+ ```