MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training
Paper
β’ 2306.00107 β’ Published
β’ 4
ONNX export of MERT-v1-95M (Music undERstanding model with large-scale self-supervised Training) for music audio feature extraction.
| File | Size | Description |
|---|---|---|
mert.onnx |
2 MB | ONNX model (FP32) |
mert.onnx.data |
360 MB | External weight data (FP32) |
mert_uint8.onnx |
117 MB | Dynamic UINT8 quantized |
from huggingface_hub import hf_hub_download
import onnxruntime as ort
import numpy as np
# Download (use mert_uint8.onnx for smaller/faster variant)
model_path = hf_hub_download("xycld/music-align-mert", "mert.onnx")
data_path = hf_hub_download("xycld/music-align-mert", "mert.onnx.data") # must be co-located
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
# Input: mono 16kHz waveform, normalized (zero-mean, unit-variance)
audio = np.random.randn(1, 16000 * 3).astype(np.float32) # 3 seconds
out = sess.run(None, {"input_values": audio})
features = out[0] # shape: [1, num_frames, 768]
[1, seq_len] float32last_hidden_state β shape [1, num_frames, 768] float32| Variant | Size | Compression |
|---|---|---|
| FP32 (original) | 362 MB | 1.0x |
| UINT8 | 117 MB | 3.1x |
This is an ONNX format conversion of the MERT-v1-95M model by the Multimodal Art Projection (M-A-P) team.
Original work:
Yizhi Li, Ruibin Yuan, Ge Zhang, et al. "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training." 2023. https://arxiv.org/abs/2306.00107
The original model weights are released under the CC-BY-NC-4.0 license. This ONNX conversion inherits the same license. Non-commercial use only.