learn-abc β€” Qwen3-VL-32B fine-tune for sheet-music β†’ ABC notation

LoRA adapter that fine-tunes Qwen/Qwen3-VL-32B-Instruct to transcribe phone photos of folk-music sheet music into ABC notation.

This is the larger of two trained variants. The 32B has a meaningfully lower training loss than its 8B sibling and beats it on uncropped phone photos, but it also tends to use a different ABC dialect (slash notation for short durations, V:1 treble voice tags) that diverges stylistically from the training corpus.

Quick use

from PIL import Image
from unsloth import FastVisionModel

model, processor = FastVisionModel.from_pretrained(
    "folk-abc/learn-abc-qwen3vl32b",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

img = Image.open("sheet_music_photo.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text":
            "Transcribe this sheet music into ABC notation. "
            "Output only the ABC, no explanation."},
    ],
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=img, text=text, add_special_tokens=False,
                   return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=800, do_sample=False)
abc = processor.decode(out[0][inputs["input_ids"].shape[1]:],
                       skip_special_tokens=True)
print(abc)

For best results, pre-crop the photo to just the sheet of paper before sending. Note accuracy drops ~25 percentage points on uncropped "phone-on-table" raw photos versus tightly-cropped photos.

The 4-bit weights need ~20 GB VRAM to load; ~25 GB total during inference. Won't fit on most consumer GPUs.

Training

Base model unsloth/Qwen3-VL-32B-Instruct-unsloth-bnb-4bit
Method LoRA (r=16, Ξ±=16, dropout=0.05) on vision + language + attention + MLP layers
Optimizer adamw_8bit, lr 5e-5, cosine schedule, warmup_ratio 0.05
Training data 36,264 augmented phone-photo-style images of 4,533 folk tunes (Jukedeck Nottingham + Henrik Norbeck collections, plus 21 hand-curated tunes), each rendered into ABC via Playwright + abcjs in two visual styles (compact, sans), then augmented with crop / perspective / rotation / blur / paper-texture / vignette / JPEG-compression effects
Targets Modal-L canonicalised ABC (per-tune snap to {1, 1/2, 1/4, 1/8, 1/16, 1/32} based on each tune's modal note duration), with R:/S:/Z:/H: metadata thinned to ≀20% prevalence per field
Steps 3,750 of 5,000 (early-stopped on patience=3, threshold=0.005)
Hardware 1Γ— H100 80GB on Runpod, ~3 hr training
Final eval loss 0.1286 (16% below the 8B sibling's 0.1522)

Evaluation on 41 hand-labelled phone photos

Metrics computed via music21 edit-distance over (midi_pitch, quarterLength) sequences after stripping under-staff metadata fields (R/S/Z/H/D/B/F) from both prediction and truth.

Cropped photos (paper tightly cropped)

metric this 32B 8B sibling
Title exact 100% 100%
Key exact 88% 88%
Meter exact 78% 95%
L: exact 56% 76%
Mean note accuracy 57% 84%
Mean pitch accuracy 87% 91%
Mean duration accuracy 63% 87%

Raw photos (full phone photo, paper on a table)

metric this 32B 8B sibling
Title exact 95% 83%
Key exact 76% 83%
Meter exact 85% 83%
L: exact 83% 83%
Mean note accuracy 61% 57%
Mean pitch accuracy 67% 63%
Mean duration accuracy 71% 64%

Interpretation

The 32B converges to a noticeably lower training loss but generates ABC in a different dialect (slash-notation for short notes, V:1 treble voice tags) that the smaller 8B doesn't. This dialect is musically equivalent but parses to different quarterLength values via music21, which hurts the note-accuracy metric on cropped photos where the model is most confident. On raw, harder photos the model becomes less verbose and aligns better with the training-data dialect β€” yielding a clear win on pitch- and duration-accuracy there.

Limitations

  • Trained only on simple folk-style sheet music (single staff, mostly one voice, common-time / triple-time). Doesn't handle piano-style two-staff or complex polyphony.
  • Tends to add V:1 treble voice declarations that aren't in the training data. These are syntactically valid ABC but stylistically different.
  • The L: field is predicted per-tune at the modal duration. Photos labelled against L:1/8 for everything will look like an L: regression on metrics even though the music itself is correct (compare against modal-L truth).
  • Quality drops ~25 pts on note accuracy when the photo isn't tightly cropped. A corner-drag pre-crop UI is recommended.
  • Slower than the 8B per token. Expect ~10-30s per transcribe on an H100, longer on smaller GPUs.

License

Apache-2.0 (matching the base model). Note that the training corpus is governed by separate licenses:

  • Jukedeck Nottingham Music Database β€” public-domain folk tunes
  • Henrik Norbeck's ABC Tunes (https://www.norbeck.nu/abc/) β€” non-commercial use only; the per-tune copyright notices remain embedded in training targets.

Acknowledgements

  • Unsloth β€” 4-bit + LoRA training
  • abcjs β€” ABC rendering used to generate training data
  • music21 β€” eval-time pitch/duration parsing
  • Henrik Norbeck β€” bulk of the training tunes
  • Jukedeck β€” Nottingham Music Database cleaning

Framework versions

  • peft 0.19.1
  • trl 0.24.0
  • transformers 5.5.0
  • unsloth 2026.4.6
  • unsloth_zoo 2026.4.8
  • torch 2.10.0
  • bitsandbytes 0.49.2
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for folk-abc/learn-abc-qwen3vl32b

Adapter
(9)
this model