You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

MuSViT: A Foundation Vision Model for Sheet Music Representation

Accepted at European Conference on Computer Vision (ECCV'26)

MuSViT-light

MuSViT (Music Score Vision Transformer) is a foundation vision encoder for music score pages. The model is a ViT pre-trained following Masked Autoencoders (MAE) on 9.7M sheet music images from the IMSLP. The embeddings produced by MuSViT are task agnostic, so they can be used for any downstream task.

Model Details

Model Description

Developed by: Pattern Recognition and Artificial Intelligence Group (PRAIG), University of Alicante, Spain
Model type: MAE
Paper: MuSViT: A Foundation Vision Model for Sheet Music Representation
License: CC BY-NC-SA 4.0

How to use

Installation

MuSViT-light does not requiere to clone any repository! Only to have installed transformers library.

Usage on music score pages

import torch
from transformers import ViTModel
from PIL import Image
from torchvision import transforms as T


image_path = 'path/image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
    T.Resize([1024, 1024]),
    T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W

model = ViTModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)

out = model(images).last_hidden_state

print(out.shape) #shape: B, 4097, 768. Note it has CLS token

Usage in systems (non-pages)

For system-level images whose reshape to 1024x1024 px would distort too much its aspect, there are two options:

Padding

Pad the image to fit input size. Recommended for zero-shot configuration

import torch
from transformers import ViTModel
from PIL import Image
from torchvision import transforms as T


image_path = 'path/staff_image.png'
image = Image.open(image_path).convert("RGB")
image.resize((1024, 64)) # (W, H)

background = Image.new("RGB", (1024, 1024), color=(255, 255, 255))
background.paste(image, (0, 0))
image = background # You might check image aspect with image.save('img.png')

processor = T.Compose([ # It already has 1024x1024 shape
    T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W

model = ViTModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)

out = model(images).last_hidden_state
out = out[:, 1:, :] # Skip CLS token
out = out.reshape(out.shape[0], 64, 64, -1) # shape: B, Rows, Columns, Dim
out = out[:, :4, :, :] # take 64/16=4 first rows
out = out.flatten(1, 2)

print(out.shape) #shape: B, 256, 768

Interpolate positional encoding

If you don't want to pad, you can interpolate positional encoding of the model. In zero-shot, this configuration downgrades embeddings quality. However, for fine-tuning MuSViT this configuration reports good performance.

import torch
from transformers import ViTModel
from PIL import Image
from torchvision import transforms as T


image_path = 'path/staff_image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
    T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W

model = ViTModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)

out = model(images, interpolate_pos_encoding=True).last_hidden_state

print(out.shape) #shape: B, Len, Dim. Note it has CLS token

Usage of pre-trained MAE model

import torch
from transformers import ViTMAEForPreTraining
from PIL import Image
from torchvision import transforms as T

image_path = 'path/image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
    T.Resize([1024, 1024]),
    T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W

model = ViTMAEForPreTraining.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)

out = model(images)

print(out.loss) # Reconstruction loss
print(out.logits.shape) #shape: B, 4096, 768. This 768 comes from 3*16*16 px to reconstruct per patch

⚠️ Warning ⚠️:

Loading with AutoModel loads the model with ViTMAEModel. This model returns the patches with the 70% masked out and shuffled. If you want all the patches, set masking to 0. Moreover, for avoiding shuffled patches set 'noise' parameter in forward with contiguous positions.

import torch
from transformers import AutoModel
from PIL import Image
from torchvision import transforms as T


image_path = 'path/image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
    T.Resize([1024, 1024]),
    T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W

model = AutoModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)
model.config.mask_ratio = 0.

noise = torch.arange(4096).expand(images.shape[0], 4096)
out = model(images, noise=noise).last_hidden_state

print(out.shape) #shape: B, 4097, 768. Note it has CLS token

To avoid all these inconveniences, we recommend loading the model with ViTModel. See code of sections above.

Citation

@inproceedings{penarrubia2026musvit,
  title     = {MuSViT: A Foundation Vision Model for Sheet Music Representation},
  author    = {Penarrubia, Carlos and Rios-Vila, Antonio and Fuentes-Martinez, Eliseo and Martinez-Sevilla, Juan C. and Castellanos, Francisco J. and Alfaro-Contreras, Maria and Calvo-Zaragoza, Jorge},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Downloads last month: -

Safetensors

Model size

39.4M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including PRAIG/musvit-light

MuSViT

Collection

Models from paper "MuSViT: A Foundation Vision Model for Sheet Music Representation" • 3 items • Updated about 11 hours ago • 1

Paper for PRAIG/musvit-light

MuSViT: A Foundation Vision Model for Sheet Music Representation

Paper • 2606.31811 • Published 2 days ago • 2