Instructions to use PRAIG/musvit-light with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PRAIG/musvit-light with Transformers:
# Load model directly from transformers import AutoImageProcessor, AutoModelForPreTraining processor = AutoImageProcessor.from_pretrained("PRAIG/musvit-light") model = AutoModelForPreTraining.from_pretrained("PRAIG/musvit-light") - Notebooks
- Google Colab
- Kaggle
MuSViT: A Foundation Vision Model for Sheet Music Representation
Accepted at European Conference on Computer Vision (ECCV'26)
MuSViT-light
MuSViT (Music Score Vision Transformer) is a foundation vision encoder for music score pages. The model is a ViT pre-trained following Masked Autoencoders (MAE) on 9.7M sheet music images from the IMSLP. The embeddings produced by MuSViT are task agnostic, so they can be used for any downstream task.
Model Details
Model Description
- Developed by: Pattern Recognition and Artificial Intelligence Group (PRAIG), University of Alicante, Spain
- Model type: MAE
- Paper: MuSViT: A Foundation Vision Model for Sheet Music Representation
- License: CC BY-NC-SA 4.0
How to use
Installation
MuSViT-light does not requiere to clone any repository! Only to have installed transformers library.
Usage on music score pages
import torch
from transformers import ViTModel
from PIL import Image
from torchvision import transforms as T
image_path = 'path/image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
T.Resize([1024, 1024]),
T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W
model = ViTModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)
out = model(images).last_hidden_state
print(out.shape) #shape: B, 4097, 768. Note it has CLS token
Usage in systems (non-pages)
For system-level images whose reshape to 1024x1024 px would distort too much its aspect, there are two options:
- Padding
Pad the image to fit input size. Recommended for zero-shot configuration
import torch
from transformers import ViTModel
from PIL import Image
from torchvision import transforms as T
image_path = 'path/staff_image.png'
image = Image.open(image_path).convert("RGB")
image.resize((1024, 64)) # (W, H)
background = Image.new("RGB", (1024, 1024), color=(255, 255, 255))
background.paste(image, (0, 0))
image = background # You might check image aspect with image.save('img.png')
processor = T.Compose([ # It already has 1024x1024 shape
T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W
model = ViTModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)
out = model(images).last_hidden_state
out = out[:, 1:, :] # Skip CLS token
out = out.reshape(out.shape[0], 64, 64, -1) # shape: B, Rows, Columns, Dim
out = out[:, :4, :, :] # take 64/16=4 first rows
out = out.flatten(1, 2)
print(out.shape) #shape: B, 256, 768
- Interpolate positional encoding
If you don't want to pad, you can interpolate positional encoding of the model. In zero-shot, this configuration downgrades embeddings quality. However, for fine-tuning MuSViT this configuration reports good performance.
import torch
from transformers import ViTModel
from PIL import Image
from torchvision import transforms as T
image_path = 'path/staff_image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W
model = ViTModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)
out = model(images, interpolate_pos_encoding=True).last_hidden_state
print(out.shape) #shape: B, Len, Dim. Note it has CLS token
Usage of pre-trained MAE model
import torch
from transformers import ViTMAEForPreTraining
from PIL import Image
from torchvision import transforms as T
image_path = 'path/image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
T.Resize([1024, 1024]),
T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W
model = ViTMAEForPreTraining.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)
out = model(images)
print(out.loss) # Reconstruction loss
print(out.logits.shape) #shape: B, 4096, 768. This 768 comes from 3*16*16 px to reconstruct per patch
⚠️ Warning ⚠️:
Loading with AutoModel loads the model with ViTMAEModel. This model returns the patches with the 70% masked out and shuffled. If you want all the patches, set masking to 0. Moreover, for avoiding shuffled patches set 'noise' parameter in forward with contiguous positions.
import torch
from transformers import AutoModel
from PIL import Image
from torchvision import transforms as T
image_path = 'path/image.png'
image = Image.open(image_path).convert("RGB")
processor = T.Compose([
T.Resize([1024, 1024]),
T.ToTensor()
])
images = processor(image).unsqueeze(0) # shape: B, C, H, W
model = AutoModel.from_pretrained('PRAIG/musvit-light', trust_remote_code=True)
model.config.mask_ratio = 0.
noise = torch.arange(4096).expand(images.shape[0], 4096)
out = model(images, noise=noise).last_hidden_state
print(out.shape) #shape: B, 4097, 768. Note it has CLS token
To avoid all these inconveniences, we recommend loading the model with ViTModel. See code of sections above.
Citation
@inproceedings{penarrubia2026musvit,
title = {MuSViT: A Foundation Vision Model for Sheet Music Representation},
author = {Penarrubia, Carlos and Rios-Vila, Antonio and Fuentes-Martinez, Eliseo and Martinez-Sevilla, Juan C. and Castellanos, Francisco J. and Alfaro-Contreras, Maria and Calvo-Zaragoza, Jorge},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
- Downloads last month
- -