arxiv:2606.31811

MuSViT: A Foundation Vision Model for Sheet Music Representation

Published on Jun 30

· Submitted by

Juan Carlos Martinez Sevilla on Jul 1

Pattern Recognition and Artificial Intelligence Group

Upvote

Authors:

Abstract

MuSViT is a vision transformer-based foundation model pre-trained on millions of sheet music pages that demonstrates superior performance in music score recognition and symbol detection tasks through both linear probing and fine-tuning approaches.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation -- a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the IMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks -- full-page and staff-level music score recognition, music symbol detection, and score difficulty classification -- under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space -- unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

JuanCarlosMartinezSevilla

Paper submitter about 10 hours ago

•

edited about 10 hours ago

Accepted at European Conference on Computer Vision (ECCV'26)

Overview of MuSViT. MuSViT is pre-trained on diverse sheet music pages using Masked Autoencoders: patches are randomly masked and the model learns to reconstruct the missing content from the remaining visible context. We evaluate the generality of the learned representations by probing the encoder across four diverse downstream tasks: full-page and staff-level music score recognition, music symbol detection, and score difficulty classification.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.31811

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.31811 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.31811 in a Space README.md to link it from this page.