You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

LEGATO

LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR. This model performs Optical Music Recognition (OMR) on typeset sheet music images, transcribing them directly to ABC notation.

🔗 Try it: Interactive Demo | Leaderboard

⚠️ Important: This model must be used with the legato codebase. It cannot be loaded with standard Transformers pipelines alone due to the custom LegatoModel architecture.

Model Details

Developed by: Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith
Model type: Vision-language model for end-to-end OMR
Architecture: Based on Llama 3.2 11B Vision (Mllama). Uses a frozen vision encoder and a trained text decoder that outputs ABC notation.
License: MIT
Paper: LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR

How to Use

Installation

Clone the repository and install dependencies:

git clone https://github.com/guang-yng/legato.git
cd legato
pip install -r requirements.txt

Tested with Python 3.12 and CUDA 12.4.

Access requirements: This model loads the vision encoder from meta-llama/Llama-3.2-11B-Vision. Ensure you have accepted the Llama 3.2 license on Hugging Face.

Inference

import torch
from PIL import Image
from transformers import AutoProcessor, GenerationConfig
from legato.models import LegatoModel

# Load model and processor
model = LegatoModel.from_pretrained("guangyangmusic/legato")
processor = AutoProcessor.from_pretrained("guangyangmusic/legato")

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and process image
image = Image.open("path/to/sheet_music.png").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate ABC notation
generation_config = GenerationConfig(
    max_length=2048,
    num_beams=10,
    repetition_penalty=1.1
)

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

# Decode output
abc_notation = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(abc_notation)

Half-Precision Inference (Reduced Memory)

model = LegatoModel.from_pretrained("guangyangmusic/legato")
model = model.to("cuda").half()  # Use FP16

Batch Inference

images = [Image.open(p).convert("RGB") for p in image_paths]
inputs = processor(images=images, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

abc_outputs = processor.batch_decode(outputs, skip_special_tokens=True)

Output

The model outputs ABC notation transcriptions. ABC can be converted to MusicXML using the conversion utilities in the legato codebase.

Intended Use

Primary use: Transcribing typeset sheet music images (piano scores, orchestral parts, etc.) to ABC notation
Audience: Researchers and practitioners in music information retrieval, digital humanities, and music technology
Out-of-scope: Handwritten notation, audio-to-score transcription, real-time inference without GPU

Limitations

Trained primarily on synthetic typeset data; performance may degrade on handwritten scores, low-quality scans, or unusual layouts
Requires significant GPU memory (~20GB+ for full precision; use --fp16 for lower memory)
Depends on access to meta-llama/Llama-3.2-11B-Vision for the vision encoder
Maximum generation length: 2048 tokens (default)

Training

Trained on PDMX-Synth with DeepSpeed ZeRO-2. For training and validation instructions, see the legato repository.

Evaluation

The codebase provides evaluation scripts for:

TEDn – Tree Edit Distance on MusicXML
OMR-NED – Normalized Edit Distance via musicdiff

See the README for evaluation commands.

Citation

@misc{yang2025legatolargescaleendtoendgeneralizable,
      title={LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR}, 
      author={Guang Yang and Victoria Ebert and Nazif Tamer and Brian Siyuan Zheng and Luiza Pozzobon and Noah A. Smith},
      year={2025},
      eprint={2506.19065},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.19065}, 
}

Related Models

Model	Link
legato (this model)	guangyangmusic/legato
legato-small	guangyangmusic/legato-small

Recommended: Use legato (this model). The small variant is mainly for baseline comparisons and is less efficient.

Downloads last month: 937

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train guangyangmusic/legato

Spaces using guangyangmusic/legato 4

Paper for guangyangmusic/legato

LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR

Paper • 2506.19065 • Published Jun 23, 2025 • 2