learn-abc — Qwen3-VL-32B fine-tune for sheet-music → ABC notation

LoRA adapter that fine-tunes Qwen/Qwen3-VL-32B-Instruct to transcribe phone photos of folk-music sheet music into ABC notation.

This is the larger of two trained variants. The 32B has a meaningfully lower training loss than its 8B sibling and beats it on uncropped phone photos, but it also tends to use a different ABC dialect (slash notation for short durations, V:1 treble voice tags) that diverges stylistically from the training corpus.

Quick use

from PIL import Image
from unsloth import FastVisionModel

model, processor = FastVisionModel.from_pretrained(
    "folk-abc/learn-abc-qwen3vl32b",
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

img = Image.open("sheet_music_photo.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text":
            "Transcribe this sheet music into ABC notation. "
            "Output only the ABC, no explanation."},
    ],
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=img, text=text, add_special_tokens=False,
                   return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=800, do_sample=False)
abc = processor.decode(out[0][inputs["input_ids"].shape[1]:],
                       skip_special_tokens=True)
print(abc)

For best results, pre-crop the photo to just the sheet of paper before sending. Note accuracy drops ~25 percentage points on uncropped "phone-on-table" raw photos versus tightly-cropped photos.

The 4-bit weights need ~20 GB VRAM to load; ~25 GB total during inference. Won't fit on most consumer GPUs.

Training


Base model	`unsloth/Qwen3-VL-32B-Instruct-unsloth-bnb-4bit`
Method	LoRA (r=16, α=16, dropout=0.05) on vision + language + attention + MLP layers
Optimizer	adamw_8bit, lr 5e-5, cosine schedule, warmup_ratio 0.05
Training data	36,264 augmented phone-photo-style images of 4,533 folk tunes (Jukedeck Nottingham + Henrik Norbeck collections, plus 21 hand-curated tunes), each rendered into ABC via Playwright + abcjs in two visual styles (`compact`, `sans`), then augmented with crop / perspective / rotation / blur / paper-texture / vignette / JPEG-compression effects
Targets	Modal-L canonicalised ABC (per-tune snap to `{1, 1/2, 1/4, 1/8, 1/16, 1/32}` based on each tune's modal note duration), with `R:`/`S:`/`Z:`/`H:` metadata thinned to ≤20% prevalence per field
Steps	3,750 of 5,000 (early-stopped on patience=3, threshold=0.005)
Hardware	1× H100 80GB on Runpod, ~3 hr training
Final eval loss	0.1286 (16% below the 8B sibling's 0.1522)

Evaluation on 41 hand-labelled phone photos

Metrics computed via music21 edit-distance over (midi_pitch, quarterLength) sequences after stripping under-staff metadata fields (R/S/Z/H/D/B/F) from both prediction and truth.

Cropped photos (paper tightly cropped)

metric	this 32B	8B sibling
Title exact	100%	100%
Key exact	88%	88%
Meter exact	78%	95%
L: exact	56%	76%
Mean note accuracy	57%	84%
Mean pitch accuracy	87%	91%
Mean duration accuracy	63%	87%

Raw photos (full phone photo, paper on a table)

metric	this 32B	8B sibling
Title exact	95%	83%
Key exact	76%	83%
Meter exact	85%	83%
L: exact	83%	83%
Mean note accuracy	61%	57%
Mean pitch accuracy	67%	63%
Mean duration accuracy	71%	64%

Interpretation

The 32B converges to a noticeably lower training loss but generates ABC in a different dialect (slash-notation for short notes, V:1 treble voice tags) that the smaller 8B doesn't. This dialect is musically equivalent but parses to different quarterLength values via music21, which hurts the note-accuracy metric on cropped photos where the model is most confident. On raw, harder photos the model becomes less verbose and aligns better with the training-data dialect — yielding a clear win on pitch- and duration-accuracy there.

Limitations

Trained only on simple folk-style sheet music (single staff, mostly one voice, common-time / triple-time). Doesn't handle piano-style two-staff or complex polyphony.
Tends to add V:1 treble voice declarations that aren't in the training data. These are syntactically valid ABC but stylistically different.
The L: field is predicted per-tune at the modal duration. Photos labelled against L:1/8 for everything will look like an L: regression on metrics even though the music itself is correct (compare against modal-L truth).
Quality drops ~25 pts on note accuracy when the photo isn't tightly cropped. A corner-drag pre-crop UI is recommended.
Slower than the 8B per token. Expect ~10-30s per transcribe on an H100, longer on smaller GPUs.

License

Apache-2.0 (matching the base model). Note that the training corpus is governed by separate licenses:

Jukedeck Nottingham Music Database — public-domain folk tunes
Henrik Norbeck's ABC Tunes (https://www.norbeck.nu/abc/) — non-commercial use only; the per-tune copyright notices remain embedded in training targets.

Acknowledgements

Unsloth — 4-bit + LoRA training
abcjs — ABC rendering used to generate training data
music21 — eval-time pitch/duration parsing
Henrik Norbeck — bulk of the training tunes
Jukedeck — Nottingham Music Database cleaning

Framework versions

peft 0.19.1
trl 0.24.0
transformers 5.5.0
unsloth 2026.4.6
unsloth_zoo 2026.4.8
torch 2.10.0
bitsandbytes 0.49.2

Downloads last month: 16

Model tree for folk-abc/learn-abc-qwen3vl32b

Base model

Qwen/Qwen3-VL-32B-Instruct

Adapter

(9)

this model