MERT-v1-95M — ONNX

ONNX export of MERT-v1-95M (Music undERstanding model with large-scale self-supervised Training) for music audio feature extraction.

Files

File	Size	Description
`mert.onnx`	2 MB	ONNX model (FP32)
`mert.onnx.data`	360 MB	External weight data (FP32)
`mert_uint8.onnx`	117 MB	Dynamic UINT8 quantized

Usage

from huggingface_hub import hf_hub_download
import onnxruntime as ort
import numpy as np

# Download (use mert_uint8.onnx for smaller/faster variant)
model_path = hf_hub_download("xycld/music-align-mert", "mert.onnx")
data_path = hf_hub_download("xycld/music-align-mert", "mert.onnx.data")  # must be co-located
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Input: mono 16kHz waveform, normalized (zero-mean, unit-variance)
audio = np.random.randn(1, 16000 * 3).astype(np.float32)  # 3 seconds
out = sess.run(None, {"input_values": audio})
features = out[0]  # shape: [1, num_frames, 768]

Model Details

Original model: m-a-p/MERT-v1-95M by Multimodal Art Projection (M-A-P)
Architecture: HuBERT-based self-supervised music encoder (95M parameters)
Precision: FP32 (original), UINT8 quantized variant available
Input: mono 16 kHz waveform — shape [1, seq_len] float32
Output: last_hidden_state — shape [1, num_frames, 768] float32
Frame rate: 50 fps (20 ms/frame, 320-sample hop at 16 kHz)
ONNX opset: 17

Quantized Variants

Variant	Size	Compression
FP32 (original)	362 MB	1.0x
UINT8	117 MB	3.1x

Attribution

This is an ONNX format conversion of the MERT-v1-95M model by the Multimodal Art Projection (M-A-P) team.

Original work:

Yizhi Li, Ruibin Yuan, Ge Zhang, et al. "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training." 2023. https://arxiv.org/abs/2306.00107

License

The original model weights are released under the CC-BY-NC-4.0 license. This ONNX conversion inherits the same license. Non-commercial use only.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for xycld/music-align-mert

MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training

Paper • 2306.00107 • Published May 31, 2023 • 4