Instructions to use folk-abc/learn-abc-qwen3vl32b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use folk-abc/learn-abc-qwen3vl32b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("unsloth/Qwen3-VL-32B-Instruct-unsloth-bnb-4bit") model = PeftModel.from_pretrained(base_model, "folk-abc/learn-abc-qwen3vl32b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use folk-abc/learn-abc-qwen3vl32b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for folk-abc/learn-abc-qwen3vl32b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for folk-abc/learn-abc-qwen3vl32b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for folk-abc/learn-abc-qwen3vl32b to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="folk-abc/learn-abc-qwen3vl32b", max_seq_length=2048, )
learn-abc β Qwen3-VL-32B fine-tune for sheet-music β ABC notation
LoRA adapter that fine-tunes Qwen/Qwen3-VL-32B-Instruct to transcribe phone photos of folk-music sheet music into ABC notation.
This is the larger of two trained variants. The 32B has a meaningfully lower training loss than its 8B sibling and beats it on uncropped phone photos, but it also tends to use a different ABC dialect (slash notation for short durations, V:1 treble voice tags) that diverges stylistically from the training corpus.
Quick use
from PIL import Image
from unsloth import FastVisionModel
model, processor = FastVisionModel.from_pretrained(
"folk-abc/learn-abc-qwen3vl32b",
load_in_4bit=True,
)
FastVisionModel.for_inference(model)
img = Image.open("sheet_music_photo.jpg").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text":
"Transcribe this sheet music into ABC notation. "
"Output only the ABC, no explanation."},
],
}]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=img, text=text, add_special_tokens=False,
return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=800, do_sample=False)
abc = processor.decode(out[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True)
print(abc)
For best results, pre-crop the photo to just the sheet of paper before sending. Note accuracy drops ~25 percentage points on uncropped "phone-on-table" raw photos versus tightly-cropped photos.
The 4-bit weights need ~20 GB VRAM to load; ~25 GB total during inference. Won't fit on most consumer GPUs.
Training
| Base model | unsloth/Qwen3-VL-32B-Instruct-unsloth-bnb-4bit |
| Method | LoRA (r=16, Ξ±=16, dropout=0.05) on vision + language + attention + MLP layers |
| Optimizer | adamw_8bit, lr 5e-5, cosine schedule, warmup_ratio 0.05 |
| Training data | 36,264 augmented phone-photo-style images of 4,533 folk tunes (Jukedeck Nottingham + Henrik Norbeck collections, plus 21 hand-curated tunes), each rendered into ABC via Playwright + abcjs in two visual styles (compact, sans), then augmented with crop / perspective / rotation / blur / paper-texture / vignette / JPEG-compression effects |
| Targets | Modal-L canonicalised ABC (per-tune snap to {1, 1/2, 1/4, 1/8, 1/16, 1/32} based on each tune's modal note duration), with R:/S:/Z:/H: metadata thinned to β€20% prevalence per field |
| Steps | 3,750 of 5,000 (early-stopped on patience=3, threshold=0.005) |
| Hardware | 1Γ H100 80GB on Runpod, ~3 hr training |
| Final eval loss | 0.1286 (16% below the 8B sibling's 0.1522) |
Evaluation on 41 hand-labelled phone photos
Metrics computed via music21 edit-distance over (midi_pitch, quarterLength) sequences after stripping under-staff metadata fields (R/S/Z/H/D/B/F) from both prediction and truth.
Cropped photos (paper tightly cropped)
| metric | this 32B | 8B sibling |
|---|---|---|
| Title exact | 100% | 100% |
| Key exact | 88% | 88% |
| Meter exact | 78% | 95% |
| L: exact | 56% | 76% |
| Mean note accuracy | 57% | 84% |
| Mean pitch accuracy | 87% | 91% |
| Mean duration accuracy | 63% | 87% |
Raw photos (full phone photo, paper on a table)
| metric | this 32B | 8B sibling |
|---|---|---|
| Title exact | 95% | 83% |
| Key exact | 76% | 83% |
| Meter exact | 85% | 83% |
| L: exact | 83% | 83% |
| Mean note accuracy | 61% | 57% |
| Mean pitch accuracy | 67% | 63% |
| Mean duration accuracy | 71% | 64% |
Interpretation
The 32B converges to a noticeably lower training loss but generates ABC in a different dialect (slash-notation for short notes, V:1 treble voice tags) that the smaller 8B doesn't. This dialect is musically equivalent but parses to different quarterLength values via music21, which hurts the note-accuracy metric on cropped photos where the model is most confident. On raw, harder photos the model becomes less verbose and aligns better with the training-data dialect β yielding a clear win on pitch- and duration-accuracy there.
Limitations
- Trained only on simple folk-style sheet music (single staff, mostly one voice, common-time / triple-time). Doesn't handle piano-style two-staff or complex polyphony.
- Tends to add
V:1 treblevoice declarations that aren't in the training data. These are syntactically valid ABC but stylistically different. - The
L:field is predicted per-tune at the modal duration. Photos labelled againstL:1/8for everything will look like an L: regression on metrics even though the music itself is correct (compare against modal-L truth). - Quality drops ~25 pts on note accuracy when the photo isn't tightly cropped. A corner-drag pre-crop UI is recommended.
- Slower than the 8B per token. Expect ~10-30s per transcribe on an H100, longer on smaller GPUs.
License
Apache-2.0 (matching the base model). Note that the training corpus is governed by separate licenses:
- Jukedeck Nottingham Music Database β public-domain folk tunes
- Henrik Norbeck's ABC Tunes (https://www.norbeck.nu/abc/) β non-commercial use only; the per-tune copyright notices remain embedded in training targets.
Acknowledgements
- Unsloth β 4-bit + LoRA training
- abcjs β ABC rendering used to generate training data
- music21 β eval-time pitch/duration parsing
- Henrik Norbeck β bulk of the training tunes
- Jukedeck β Nottingham Music Database cleaning
Framework versions
- peft 0.19.1
- trl 0.24.0
- transformers 5.5.0
- unsloth 2026.4.6
- unsloth_zoo 2026.4.8
- torch 2.10.0
- bitsandbytes 0.49.2
- Downloads last month
- 16
Model tree for folk-abc/learn-abc-qwen3vl32b
Base model
Qwen/Qwen3-VL-32B-Instruct