MERT-v1-95M β€” ONNX

ONNX export of MERT-v1-95M (Music undERstanding model with large-scale self-supervised Training) for music audio feature extraction.

Files

File Size Description
mert.onnx 2 MB ONNX model (FP32)
mert.onnx.data 360 MB External weight data (FP32)
mert_uint8.onnx 117 MB Dynamic UINT8 quantized

Usage

from huggingface_hub import hf_hub_download
import onnxruntime as ort
import numpy as np

# Download (use mert_uint8.onnx for smaller/faster variant)
model_path = hf_hub_download("xycld/music-align-mert", "mert.onnx")
data_path = hf_hub_download("xycld/music-align-mert", "mert.onnx.data")  # must be co-located
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# Input: mono 16kHz waveform, normalized (zero-mean, unit-variance)
audio = np.random.randn(1, 16000 * 3).astype(np.float32)  # 3 seconds
out = sess.run(None, {"input_values": audio})
features = out[0]  # shape: [1, num_frames, 768]

Model Details

  • Original model: m-a-p/MERT-v1-95M by Multimodal Art Projection (M-A-P)
  • Architecture: HuBERT-based self-supervised music encoder (95M parameters)
  • Precision: FP32 (original), UINT8 quantized variant available
  • Input: mono 16 kHz waveform β€” shape [1, seq_len] float32
  • Output: last_hidden_state β€” shape [1, num_frames, 768] float32
  • Frame rate: 50 fps (20 ms/frame, 320-sample hop at 16 kHz)
  • ONNX opset: 17

Quantized Variants

Variant Size Compression
FP32 (original) 362 MB 1.0x
UINT8 117 MB 3.1x

Attribution

This is an ONNX format conversion of the MERT-v1-95M model by the Multimodal Art Projection (M-A-P) team.

Original work:

Yizhi Li, Ruibin Yuan, Ge Zhang, et al. "MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training." 2023. https://arxiv.org/abs/2306.00107

License

The original model weights are released under the CC-BY-NC-4.0 license. This ONNX conversion inherits the same license. Non-commercial use only.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for xycld/music-align-mert