LEGATO
LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR. This model performs Optical Music Recognition (OMR) on typeset sheet music images, transcribing them directly to ABC notation.
🔗 Try it: Interactive Demo | Leaderboard
⚠️ Important: This model must be used with the legato codebase. It cannot be loaded with standard Transformers pipelines alone due to the custom LegatoModel architecture.
Model Details
- Developed by: Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith
- Model type: Vision-language model for end-to-end OMR
- Architecture: Based on Llama 3.2 11B Vision (Mllama). Uses a frozen vision encoder and a trained text decoder that outputs ABC notation.
- License: MIT
- Paper: LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR
How to Use
Installation
- Clone the repository and install dependencies:
git clone https://github.com/guang-yng/legato.git
cd legato
pip install -r requirements.txt
Tested with Python 3.12 and CUDA 12.4.
- Access requirements: This model loads the vision encoder from
meta-llama/Llama-3.2-11B-Vision. Ensure you have accepted the Llama 3.2 license on Hugging Face.
Inference
import torch
from PIL import Image
from transformers import AutoProcessor, GenerationConfig
from legato.models import LegatoModel
# Load model and processor
model = LegatoModel.from_pretrained("guangyangmusic/legato")
processor = AutoProcessor.from_pretrained("guangyangmusic/legato")
# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Load and process image
image = Image.open("path/to/sheet_music.png").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
# Generate ABC notation
generation_config = GenerationConfig(
max_length=2048,
num_beams=10,
repetition_penalty=1.1
)
with torch.no_grad():
outputs = model.generate(**inputs, generation_config=generation_config)
# Decode output
abc_notation = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(abc_notation)
Half-Precision Inference (Reduced Memory)
model = LegatoModel.from_pretrained("guangyangmusic/legato")
model = model.to("cuda").half() # Use FP16
Batch Inference
images = [Image.open(p).convert("RGB") for p in image_paths]
inputs = processor(images=images, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, generation_config=generation_config)
abc_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
Output
The model outputs ABC notation transcriptions. ABC can be converted to MusicXML using the conversion utilities in the legato codebase.
Intended Use
- Primary use: Transcribing typeset sheet music images (piano scores, orchestral parts, etc.) to ABC notation
- Audience: Researchers and practitioners in music information retrieval, digital humanities, and music technology
- Out-of-scope: Handwritten notation, audio-to-score transcription, real-time inference without GPU
Limitations
- Trained primarily on synthetic typeset data; performance may degrade on handwritten scores, low-quality scans, or unusual layouts
- Requires significant GPU memory (~20GB+ for full precision; use
--fp16for lower memory) - Depends on access to
meta-llama/Llama-3.2-11B-Visionfor the vision encoder - Maximum generation length: 2048 tokens (default)
Training
Trained on PDMX-Synth with DeepSpeed ZeRO-2. For training and validation instructions, see the legato repository.
Evaluation
The codebase provides evaluation scripts for:
- TEDn – Tree Edit Distance on MusicXML
- OMR-NED – Normalized Edit Distance via musicdiff
See the README for evaluation commands.
Citation
@misc{yang2025legatolargescaleendtoendgeneralizable,
title={LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR},
author={Guang Yang and Victoria Ebert and Nazif Tamer and Brian Siyuan Zheng and Luiza Pozzobon and Noah A. Smith},
year={2025},
eprint={2506.19065},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.19065},
}
Related Models
| Model | Link |
|---|---|
| legato (this model) | guangyangmusic/legato |
| legato-small | guangyangmusic/legato-small |
Recommended: Use
legato(this model). The small variant is mainly for baseline comparisons and is less efficient.
- Downloads last month
- 2,228