lilyBERT

lilyBERT is a masked language model for LilyPond music notation, built by adapting CodeBERT to the musical domain.

LilyPond is a text-based music engraving language with formal grammar, block structure, and backslash commands โ€” making it structurally similar to a programming language. lilyBERT leverages this by extending CodeBERT's vocabulary with 115 domain-specific tokens (e.g. \trill, \fermata, \mordent, \staccato) and performing MLM pre-training on curated Baroque music scores.

Training

This checkpoint was trained in two stages:

  1. Stage 1 โ€” PDMX pre-training: CodeBERT fine-tuned on the PDMX corpus of automatically converted LilyPond files.
  2. Stage 2 โ€” BMdataset fine-tuning: Further fine-tuned on the BMdataset, a musicologically curated collection of 470 Baroque scores in LilyPond format (90M tokens).
Hyperparameter Value
Architecture RobertaForMaskedLM (12 layers, 768 hidden, 12 heads)
Vocab size 50,380 (50,265 base + 115 music tokens)
Max sequence length 512
MLM probability 0.15
Batch size 72 ร— 2 GPUs ร— 2 grad. accum. = 288
Learning rate 2e-4 (cosine schedule)
Warmup 10%
Epochs 10 (early stopping, patience 5)
Precision bf16
Optimizer AdamW (fused)

Results

Linear probing on the out-of-domain Mutopia corpus (layer 6, 5-fold CV):

Model Composer Acc. Style Acc.
CB + PDMX_full (15B tokens) 80.8 82.6
CB + BMdataset (90M tokens) 82.9 83.7
CB + PDMX_90M (90M tokens) 81.7 82.3
CB + PDMX โ†’ BM (this model) 84.3 82.9

90M tokens of expertly curated data outperform 15B tokens of automatically converted data. Combining broad pre-training with domain-specific fine-tuning yields the best overall composer accuracy.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("csc-unipd/lilybert")
model = AutoModelForMaskedLM.from_pretrained("csc-unipd/lilybert")

Fill-mask example

from transformers import pipeline

filler = pipeline("fill-mask", model="csc-unipd/lilybert")
filler("\\relative c' { c4 d <mask> f | g2 g }")

Feature extraction

import torch

inputs = tokenizer("\\relative c' { c4 d e f | g2 g }", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Layer 6 embeddings (best for linear probing)
embeddings = outputs.hidden_states[6]

Citation

@misc{spanio2026bmdataset,
  title     = {BMdataset: A Musicologically Curated LilyPond Dataset},
  author    = {Spanio, Matteo and Guler, Ilay and Roda, Antonio},
  year      = {2026},
  publisher = {Under review},
}

Links

License

Apache-2.0

Downloads last month
26
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for csc-unipd/lilybert

Finetuned
(135)
this model

Evaluation results