Qwen3.5-9B — Iconclass VLM (SFT on brillfull)

A vision–language model that labels artwork images with ICONCLASS iconographic codes (e.g. 11D35 Crucifixion, 25F23 roses, 61B:31D14 portrait of a man). Fine-tuned from unsloth/Qwen3.5-9B-Base (multimodal, arch qwen3_5) on davanstrien/iconclass-vlm-brillfull.

Input: one artwork image. Output: a JSON object {"iconclass-codes": ["...", ...]}.

TL;DR — why this model exists

This 9B is the result that broke the recall ceiling a 4B version of the same pipeline could not. On a clean, contamination-free 788-image test (full human labels):

system Hierarchical F1 code-recall code-prec notes
this 9B (single-shot) 53.0 32.9 29.7 best single system
4B SFT (…-sft-brillfull) 45.2 25.6 23.7 predecessor; capability-bound at ~25% recall
anchored fusion (4B + retrieval + judge) 48.5 — — prior best pipeline (no training)
9B + retrieval fusion 51.7 36.3 18.4 fusion now redundant — see below
  • +7.8 H-F1 / +7.3 code-recall over the 4B — same data, same recipe, only the size changed. The ~25% recall wall that survived reward-tuning, fuller labels, and reasoning-distillation on the 4B is moved by capacity: it was a 4B-size limit, not a fundamental task ceiling.
  • The 9B alone beats the anchored retrieval+judge fusion pipeline (48.5) with zero inference-time machinery. Adding retrieval on top of the 9B (51.7) now hurts H-F1 — the 9B already captures the visually-recoverable recall, so retrieval's imprecise extra codes cost more precision than they add.

Intended use & limitations

  • Use: assisted iconographic cataloguing of (mostly European, 15th–19th c.) art images — suggest ICONCLASS codes for a human cataloguer to confirm. Multi-label.
  • Not for: authoritative/automatic cataloguing without review. ICONCLASS is specialized; expect a human in the loop.
  • Limitations:
    • Valid-JSON ≈ 86% at max_new_tokens=384 (~14% of outputs are malformed/truncated) — so 53.0 is a conservative floor; a constrained-decoding / cleanup pass would recover a little more.
    • Structural ceiling on non-visual codes — proverbs, named persons, literary/abstract subjects are not determinable from the image alone (no vision model recovers these; they need external knowledge / retrieval at inference).
    • Trained on the Brill/Arkyves distribution; may transfer less well to very different visual domains.

How to use

import json, torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "davanstrien/qwen35-9b-iconclass-sft-brillfull"
model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

INSTRUCTION = ("Classify this image using Iconclass codes. "
               "Return a JSON object with key 'iconclass-codes' containing a list of codes.")
image = Image.open("artwork.jpg").convert("RGB")
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": INSTRUCTION}]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False,
                                     enable_thinking=False)  # this model does NOT use <think>
inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=384, do_sample=False)
resp = processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
codes = json.loads(resp)["iconclass-codes"]   # e.g. ["11D35", "25G3", ...]

Training

  • Base: unsloth/Qwen3.5-9B-Base (chat template cloned from Qwen/Qwen3.5-4B, <think> stripped).
  • Data: davanstrien/iconclass-vlm-brillfull, config sft (~82k train; full, cleaned labels; the test split is a contamination-safe held-out 788 used below).
  • Method: LoRA (r=16, all vision+language layers, 51M/9.46B = 0.54% trainable) via Unsloth, 1 epoch, bf16, per-device batch 16 × grad-accum 2, lr 2e-4 cosine. train_sft_brill.py --base-model unsloth/Qwen3.5-9B-Base (a drop-in over the 4B recipe).
  • Hardware/cost: HF Jobs a100-large (1× A100-80GB), 3.6 h ($9). eval_loss 0.428 (vs the 4B's 0.474).
  • Deps: transformers==5.2.0 (only version with qwen3_5), unsloth<2026.6, causal-conv1d (pinned wheel), flash-linear-attention — Qwen3.5's hybrid attention needs these.

Evaluation

Ruler: davanstrien/iconclass-vlm-brillfull test (788 images, full human labels, contamination-safe split by filename hash), hierarchical-F1 with partial credit (eval_sft.py:_calculate_hierarchical_f1 — ancestor matches earn graded credit). Greedy decoding, enable_thinking=False. See the results table above. Note: raw H-F1 understates true performance because the ground truth is ~20–40% incomplete (many predicted codes are genuinely depicted but unlabeled) — a judge-corrected ruler (eval_corrected.py) credits these.

The research arc (so this can be picked up later)

This model is one step in an investigation (full logs in the model-training/iconclass-qwen35 repo — RESEARCH_LOG.md, WEAK_LABELING.md, CLAUDE.md):

  1. 4B is capability-bound — reward-tuning, fuller labels, and reasoning-distillation all plateau at ~25% recall / ~45 H-F1.
  2. Anchored fusion (4B + retriever + judge-gate, fuse_rank.py) was the prior deployable win (48.5).
  3. Agent + abstain-reviewer = a ~90% precision weak-labeler (weak_label_quality.py, review_set.py, calibrate_review.py) — validated as a data engine for unlabeled images (235B-VL via the HF router, no setup).
  4. This 9B — capacity breaks the recall wall and beats the fusion pipeline → the headline result.

Suggested next steps

  • Format cleanup (constrained decoding / a JSON-repair pass) to recover the ~14% invalid-JSON handicap.
  • Noisy-student self-training: weak-label NEW images (biglam/european_art) with the agent + abstain-reviewer → SFT this 9B on brillfull-GT ∪ high-confidence weak-labels → eval clean-788 vs 53.0. Tests whether clean NEW-image data pushes past capacity, or the 9B is at the learnable-visual ceiling.
  • External-knowledge tools at inference for the non-visual code residue (the structural ceiling).
  • The retriever (davanstrien/iconclass-retriever-bge-ft) + agent stack remain useful for recall-priority cataloguing even though fusion is H-F1-redundant on this 9B.

Related

  • Predecessor: davanstrien/qwen35-4b-iconclass-sft-brillfull (4B, 45.2 H-F1).
  • Dataset: davanstrien/iconclass-vlm-brillfull (full-label, contamination-safe).
  • Retriever: davanstrien/iconclass-retriever-bge-ft.
  • Source lineage: Brill ICONCLASS AI Test Set / Arkyves.
Downloads last month
105
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for davanstrien/qwen35-9b-iconclass-sft-brillfull

Adapter
(1)
this model

Dataset used to train davanstrien/qwen35-9b-iconclass-sft-brillfull