lilyBERT
lilyBERT is a masked language model for LilyPond music notation, built by adapting CodeBERT to the musical domain.
LilyPond is a text-based music engraving language with formal grammar, block structure, and backslash commands โ making it structurally similar to a programming language. lilyBERT leverages this by extending CodeBERT's vocabulary with 115 domain-specific tokens (e.g. \trill, \fermata, \mordent, \staccato) and performing MLM pre-training on curated Baroque music scores.
Training
This checkpoint was trained in two stages:
- Stage 1 โ PDMX pre-training: CodeBERT fine-tuned on the PDMX corpus of automatically converted LilyPond files.
- Stage 2 โ BMdataset fine-tuning: Further fine-tuned on the BMdataset, a musicologically curated collection of
470 Baroque scores in LilyPond format (90M tokens).
| Hyperparameter | Value |
|---|---|
| Architecture | RobertaForMaskedLM (12 layers, 768 hidden, 12 heads) |
| Vocab size | 50,380 (50,265 base + 115 music tokens) |
| Max sequence length | 512 |
| MLM probability | 0.15 |
| Batch size | 72 ร 2 GPUs ร 2 grad. accum. = 288 |
| Learning rate | 2e-4 (cosine schedule) |
| Warmup | 10% |
| Epochs | 10 (early stopping, patience 5) |
| Precision | bf16 |
| Optimizer | AdamW (fused) |
Results
Linear probing on the out-of-domain Mutopia corpus (layer 6, 5-fold CV):
| Model | Composer Acc. | Style Acc. |
|---|---|---|
| CB + PDMX_full (15B tokens) | 80.8 | 82.6 |
| CB + BMdataset (90M tokens) | 82.9 | 83.7 |
| CB + PDMX_90M (90M tokens) | 81.7 | 82.3 |
| CB + PDMX โ BM (this model) | 84.3 | 82.9 |
90M tokens of expertly curated data outperform 15B tokens of automatically converted data. Combining broad pre-training with domain-specific fine-tuning yields the best overall composer accuracy.
Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("csc-unipd/lilybert")
model = AutoModelForMaskedLM.from_pretrained("csc-unipd/lilybert")
Fill-mask example
from transformers import pipeline
filler = pipeline("fill-mask", model="csc-unipd/lilybert")
filler("\\relative c' { c4 d <mask> f | g2 g }")
Feature extraction
import torch
inputs = tokenizer("\\relative c' { c4 d e f | g2 g }", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Layer 6 embeddings (best for linear probing)
embeddings = outputs.hidden_states[6]
Citation
@misc{spanio2026bmdataset,
title = {BMdataset: A Musicologically Curated LilyPond Dataset},
author = {Spanio, Matteo and Guler, Ilay and Roda, Antonio},
year = {2026},
publisher = {Under review},
}
Links
- Paper: SMC 2026 (to appear)
- Dataset: Zenodo (doi:10.5281/zenodo.18723290)
- Code: GitHub (CSCPadova/lilybert)
License
Apache-2.0
- Downloads last month
- 26
Model tree for csc-unipd/lilybert
Base model
microsoft/codebert-baseEvaluation results
- Composer Accuracy on Mutopia (out-of-domain)self-reported84.300
- Style Accuracy on Mutopia (out-of-domain)self-reported82.900