You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

LEGATO

LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR. This model performs Optical Music Recognition (OMR) on typeset sheet music images, transcribing them directly to ABC notation.

🔗 Try it: Interactive Demo | Leaderboard

⚠️ Important: This model must be used with the legato codebase. It cannot be loaded with standard Transformers pipelines alone due to the custom LegatoModel architecture.

Model Details

  • Developed by: Guang Yang, Victoria Ebert, Nazif Tamer, Brian Siyuan Zheng, Luiza Pozzobon, Noah A. Smith
  • Model type: Vision-language model for end-to-end OMR
  • Architecture: Based on Llama 3.2 11B Vision (Mllama). Uses a frozen vision encoder and a trained text decoder that outputs ABC notation.
  • License: MIT
  • Paper: LEGATO: Large-Scale End-to-End Generalizable Approach to Typeset OMR

How to Use

Installation

  1. Clone the repository and install dependencies:
git clone https://github.com/guang-yng/legato.git
cd legato
pip install -r requirements.txt

Tested with Python 3.12 and CUDA 12.4.

  1. Access requirements: This model loads the vision encoder from meta-llama/Llama-3.2-11B-Vision. Ensure you have accepted the Llama 3.2 license on Hugging Face.

Inference

import torch
from PIL import Image
from transformers import AutoProcessor, GenerationConfig
from legato.models import LegatoModel

# Load model and processor
model = LegatoModel.from_pretrained("guangyangmusic/legato")
processor = AutoProcessor.from_pretrained("guangyangmusic/legato")

# Move to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load and process image
image = Image.open("path/to/sheet_music.png").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate ABC notation
generation_config = GenerationConfig(
    max_length=2048,
    num_beams=10,
    repetition_penalty=1.1
)

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

# Decode output
abc_notation = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(abc_notation)

Half-Precision Inference (Reduced Memory)

model = LegatoModel.from_pretrained("guangyangmusic/legato")
model = model.to("cuda").half()  # Use FP16

Batch Inference

images = [Image.open(p).convert("RGB") for p in image_paths]
inputs = processor(images=images, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    outputs = model.generate(**inputs, generation_config=generation_config)

abc_outputs = processor.batch_decode(outputs, skip_special_tokens=True)

Output

The model outputs ABC notation transcriptions. ABC can be converted to MusicXML using the conversion utilities in the legato codebase.

Intended Use

  • Primary use: Transcribing typeset sheet music images (piano scores, orchestral parts, etc.) to ABC notation
  • Audience: Researchers and practitioners in music information retrieval, digital humanities, and music technology
  • Out-of-scope: Handwritten notation, audio-to-score transcription, real-time inference without GPU

Limitations

  • Trained primarily on synthetic typeset data; performance may degrade on handwritten scores, low-quality scans, or unusual layouts
  • Requires significant GPU memory (~20GB+ for full precision; use --fp16 for lower memory)
  • Depends on access to meta-llama/Llama-3.2-11B-Vision for the vision encoder
  • Maximum generation length: 2048 tokens (default)

Training

Trained on PDMX-Synth with DeepSpeed ZeRO-2. For training and validation instructions, see the legato repository.

Evaluation

The codebase provides evaluation scripts for:

  • TEDn – Tree Edit Distance on MusicXML
  • OMR-NED – Normalized Edit Distance via musicdiff

See the README for evaluation commands.

Citation

@misc{yang2025legatolargescaleendtoendgeneralizable,
      title={LEGATO: Large-scale End-to-end Generalizable Approach to Typeset OMR}, 
      author={Guang Yang and Victoria Ebert and Nazif Tamer and Brian Siyuan Zheng and Luiza Pozzobon and Noah A. Smith},
      year={2025},
      eprint={2506.19065},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.19065}, 
}

Related Models

Model Link
legato (this model) guangyangmusic/legato
legato-small guangyangmusic/legato-small

Recommended: Use legato (this model). The small variant is mainly for baseline comparisons and is less efficient.

Downloads last month
2,228
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train guangyangmusic/legato

Spaces using guangyangmusic/legato 2

Paper for guangyangmusic/legato