lilybert / README.md
matteospanio's picture
Update README.md
3d80313 verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
pipeline_tag: fill-mask
tags:
  - music
  - lilypond
  - mlm
  - music-information-retrieval
base_model: microsoft/codebert-base
datasets:
  - custom
model-index:
  - name: lilyBERT
    results:
      - task:
          type: text-classification
          name: Composer Classification (Linear Probe)
        dataset:
          type: custom
          name: Mutopia (out-of-domain)
        metrics:
          - type: accuracy
            value: 84.3
            name: Composer Accuracy
          - type: accuracy
            value: 82.9
            name: Style Accuracy
metrics:
  - accuracy

lilyBERT

lilyBERT is a masked language model for LilyPond music notation, built by adapting CodeBERT to the musical domain.

LilyPond is a text-based music engraving language with formal grammar, block structure, and backslash commands — making it structurally similar to a programming language. lilyBERT leverages this by extending CodeBERT's vocabulary with 115 domain-specific tokens (e.g. \trill, \fermata, \mordent, \staccato) and performing MLM pre-training on curated Baroque music scores.

Training

This checkpoint was trained in two stages:

  1. Stage 1 — PDMX pre-training: CodeBERT fine-tuned on the PDMX corpus of automatically converted LilyPond files.
  2. Stage 2 — BMdataset fine-tuning: Further fine-tuned on the BMdataset, a musicologically curated collection of 470 Baroque scores in LilyPond format (90M tokens).
Hyperparameter Value
Architecture RobertaForMaskedLM (12 layers, 768 hidden, 12 heads)
Vocab size 50,380 (50,265 base + 115 music tokens)
Max sequence length 512
MLM probability 0.15
Batch size 72 × 2 GPUs × 2 grad. accum. = 288
Learning rate 2e-4 (cosine schedule)
Warmup 10%
Epochs 10 (early stopping, patience 5)
Precision bf16
Optimizer AdamW (fused)

Results

Linear probing on the out-of-domain Mutopia corpus (layer 6, 5-fold CV):

Model Composer Acc. Style Acc.
CB + PDMX_full (15B tokens) 80.8 82.6
CB + BMdataset (90M tokens) 82.9 83.7
CB + PDMX_90M (90M tokens) 81.7 82.3
CB + PDMX → BM (this model) 84.3 82.9

90M tokens of expertly curated data outperform 15B tokens of automatically converted data. Combining broad pre-training with domain-specific fine-tuning yields the best overall composer accuracy.

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("csc-unipd/lilybert")
model = AutoModelForMaskedLM.from_pretrained("csc-unipd/lilybert")

Fill-mask example

from transformers import pipeline

filler = pipeline("fill-mask", model="csc-unipd/lilybert")
filler("\\relative c' { c4 d <mask> f | g2 g }")

Feature extraction

import torch

inputs = tokenizer("\\relative c' { c4 d e f | g2 g }", return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# Layer 6 embeddings (best for linear probing)
embeddings = outputs.hidden_states[6]

Citation

@misc{spanio2026lilybert,
      title={BMdataset: A Musicologically Curated LilyPond Dataset}, 
      author={Matteo Spanio and Ilay Guler and Antonio Rodà},
      year={2026},
      eprint={2604.10628},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2604.10628}, 
}

Links

License

Apache-2.0