qwen3vl-resume-parser

A QLoRA fine-tune of Qwen/Qwen3-VL-8B-Instruct that reads resume/CV page images and returns a fixed 23-field JSON record. Published as merged full weights (BF16 safetensors, ~9B params), so it loads like any standard Qwen3-VL checkpoint — no adapter to attach.

It started as an internal project at Corporate Solutions Group: the production resume parser ran on Qwen2.5-VL-32B (8-bit, ~50 GB VRAM) behind a vLLM server, and the goal was a smaller model that kept parsing quality while cutting GPU cost. This repo is the public, portfolio version of that work. The training data is a private internal dataset and is not redistributed; all code (data prep, training, eval) is open at github.com/sukhrobnurali/resume-trainer.

TL;DR

  • Base: Qwen/Qwen3-VL-8B-Instruct (QLoRA, then merged → BF16).
  • Task: resume page image(s) → structured JSON (23 fields: identity, contact, skills, experiences, educations, languages, certificates, projects, preferences).
  • Why fine-tune: the 23-field schema and the project's formatting rules are baked into the weights, so a one-line prompt replaces the ~280-line schema prompt the 32B base needed.
  • Measured (full 51-sample held-out split, A100, BF16, greedy): 83.9% weighted score, 88.2% unweighted, 88.2% JSON-valid. See Evaluation for the honest caveats.
  • Footprint: ~23 GB VRAM in BF16 at 16K context (vs. ~50 GB for the 32B it replaces).

Intended use

Extracting structured data from resume/CV documents rendered to images (PDF → PNG per page). The model is tuned for a specific downstream schema (below) used by a recruiting/ATS pipeline, including its enum vocabularies (PascalCase country names, a fixed list of roles/technologies/industries). It is most useful when you want one model call to turn a resume into a database-ready record.

It is not a general document-VQA model and should not be used to make automated decisions about candidates — see Out-of-scope.

Input / output schema

Input: one or more page images of a single resume, plus the short instruction the model was trained with (see How to use).

Output: a single JSON object with 23 top-level fields. Scalars are null when absent; list fields default to []; address defaults to {country_name, region_name}.

Field Type Notes
first_name, last_name string
email, phone string
date_of_birth string YYYY-MM-DD
desired_position string mapped to a fixed role vocabulary
about string free-text summary
job_experience number total years
job_expectations, min_salary, max_salary string / number
ready_to_relocation bool
work_modes, employment_types, employment_durations string[] enum values
hobbies string
address object {country_name, region_name}
skills object[] {skill_name, level}
experiences object[] {company_name, job, date_from, date_to, description, country_name}
educations object[] {name, degree, location, programme, date_from, date_to, country_name}
languages object[] {language_name, level} (level is an int)
certificates object[] {certificate_name, certificate_programme, issuing_date, expiring_date}
projects object[] {title, summary, used_technologies[], role, industries[]}

Dates are normalized to YYYY-MM-DD (year-only ranges expand to Jan 1 / Dec 31; ongoing roles set date_to: null). Classification fields (desired_position, project role / used_technologies / industries, and all country_name fields) are mapped to predefined option lists, falling back to "Other" when nothing matches.

Real (anonymized) output example:

{
  "first_name": "Jane",
  "last_name": "Doe",
  "date_of_birth": null,
  "email": "jane@example.com",
  "phone": "+1-555-0100",
  "desired_position": "Android Developer",
  "about": null,
  "job_experience": null,
  "job_expectations": null,
  "min_salary": null,
  "max_salary": null,
  "ready_to_relocation": false,
  "work_modes": [],
  "employment_types": [],
  "employment_durations": [],
  "hobbies": null,
  "address": { "country_name": "Uzbekistan", "region_name": "Tashkent" },
  "skills": [
    { "skill_name": "Android Development", "level": null },
    { "skill_name": "Kotlin", "level": null },
    { "skill_name": "Firebase", "level": null }
  ],
  "experiences": [
    {
      "company_name": "Android Development Course",
      "job": "Student / Trainee (Android Development)",
      "date_from": "2021-01-01",
      "date_to": null,
      "description": "Android development course focused on Java/Kotlin/Android.",
      "country_name": null
    }
  ],
  "languages": [
    { "language_name": "Uzbek", "level": 6 },
    { "language_name": "English", "level": 2 },
    { "language_name": "Russian", "level": 0 }
  ],
  "educations": [
    {
      "name": "Tashkent University of Information Technologies",
      "degree": "Bachelor",
      "location": "Tashkent",
      "programme": "E-Commerce",
      "date_from": null,
      "date_to": "2019-01-01",
      "country_name": "Uzbekistan"
    }
  ],
  "certificates": [],
  "projects": [
    {
      "title": "Wallpaper App",
      "summary": "Wallpaper app based on MVVM, Coin, Flow, Retrofit.",
      "used_technologies": ["Kotlin", "Other"],
      "role": "Mobile Developer(IOS/Android)",
      "industries": ["Other"]
    }
  ]
}

Training data

  • 513 human-verified resume samples (private internal dataset). Each sample is a PDF rendered to one or more page PNGs plus a verified ground-truth JSON record.
  • Split: 462 train / 51 held-out eval, 90/10, fixed seed 42. Samples whose estimated token length exceeded ~15.2K (1K below the 16,384 context budget) were dropped from training, so the effective training count is ≤462.
  • Page distribution: 276 single-page, 136 two-page, 101 three-or-more-page (up to 8).
  • Language: predominantly English; some records contain non-English values (e.g. Russian/Uzbek company or language names).

The dataset is not released. Code to rebuild splits and bundles is in the repo (src/data_prep.py, src/export_eval_bundle.py).

Training procedure

QLoRA via Unsloth (FastVisionModel) + TRL SFTTrainer. The 4-bit base (unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit, nf4) was adapted with LoRA on both the vision and language towers (attention + MLP modules), then the adapter was merged back into the full model and published.

Each training example is a single user turn — the page images followed by the combined system+user instruction — with the ground-truth JSON as the assistant target. There is no separate system role; this is why inference uses the same short prompt.

Dtype note: the merge used Unsloth's merged_16bit, and the original upload was labeled "float16", but the published config.json and stored tensors are bfloat16. Treat this model as BF16.

Hyperparameters

Hyperparameter Value
Method QLoRA (4-bit nf4 base + LoRA, merged after training)
LoRA rank / alpha / dropout 16 / 16 / 0
Target modules vision + language layers, attention + MLP (bias="none", no rslora)
Learning rate 2e-4
LR scheduler / warmup cosine / 10 steps
Optimizer adamw_8bit
Weight decay 0.01
Per-device batch / grad-accum 1 / 4 (effective batch 4)
Epochs 1
Max sequence length 16,384
Precision bf16 (fp16 fallback if unsupported)
Seed 3407
Hardware Google Colab L4 (24 GB)

Training time and final loss were not captured from the run.

Evaluation

Measured on 2026-06-05 with notebooks/eval_finetuned.ipynb against the held-out split, using the project's field-weighted scorer (src/evaluation.py). Setup: the published BF16 weights on a single A100, greedy decoding (do_sample=False, max_new_tokens=4096), on the full 51-sample held-out split.

Metric Result
Overall weighted score 83.9%
Overall unweighted score 88.2%
JSON validity 88.2% (45/51 parsed; 6 failures)
Avg. inference ~92.0 s/resume
Peak VRAM 23.4 GB

Per-field accuracy (worst → best):

Field Acc Field Acc
skills 67.5% ready_to_relocation 88.2%
phone 74.5% certificates 90.8%
desired_position 79.2% projects 91.0%
address 81.2% job_expectations 92.7%
experiences 81.7% hobbies 96.1%
first_name 82.3% date_of_birth 98.0%
last_name 82.3% work_modes 98.0%
email 84.3% employment_types 98.0%
job_experience 84.3% employment_durations 98.0%
educations 84.5% min_salary 100.0%
languages 87.2% max_salary 100.0%
about 88.2%

Read these numbers with the following caveats:

  • Full held-out split, single run. These are all 51 held-out samples with greedy decoding — a real measurement, but one run on a modest test set, not a large benchmark.
  • Partial-credit metric. The scorer uses fuzzy string ratios, date/numeric tolerances, and greedy best-match over object arrays, with fields weighted by importance (work experience is weighted highest). It is not strict exact-match and is not comparable to other parsers' published numbers — it is an internal quality signal. The weighted score (83.9%) is below the unweighted (88.2%) because the highest-weighted fields — experiences, skills, identity/contact — are also the hardest ones.
  • The top-scoring fields are mostly "correctly empty." min_salary/max_salary (100%) and date_of_birth, work_modes, employment_types, employment_durations (~98%) are almost always absent in this data, so high scores largely reflect correctly returning empty — not hard extraction.
  • 6/51 invalid JSON (~12%). Most likely 4096-token truncation on long multi-page resumes; downstream code must handle un-parseable output (retry, repair, or shorter prompts).

For context, the model-selection benchmark that led to Qwen3-VL-8B (base models, ~10 samples, not reproducible from committed outputs) is noted in the repo's SESSION_LOG.md; it is not a fine-tuned result and is excluded here.

How to use

Requires a recent transformers (≥4.57 for Qwen3-VL; latest recommended). The published processor carries the correct chat template, so the modern image-in-messages path works without extra utilities.

# pip install -U transformers accelerate
import json
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model_id = "sukhrobnurali/qwen3vl-resume-parser"
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, dtype="auto", device_map="auto", attn_implementation="sdpa",
)
processor = AutoProcessor.from_pretrained(model_id)

# The 23-field schema is baked into the weights, so the short training prompt is all it needs.
SYSTEM_PROMPT = "You are a resume parser. Extract information from resume images into structured JSON."
USER_PROMPT = "Parse this resume and return the structured JSON."

# One entry per page, top to bottom. "url" accepts a local file path or an http(s) URL.
pages = ["resume_page_1.png", "resume_page_2.png"]

messages = [{
    "role": "user",
    "content": (
        [{"type": "text", "text": SYSTEM_PROMPT}]
        + [{"type": "image", "url": p} for p in pages]
        + [{"type": "text", "text": USER_PROMPT}]
    ),
}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)
inputs.pop("token_type_ids", None)

generated = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
trimmed = generated[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(trimmed, skip_special_tokens=True)[0]

resume = json.loads(text)  # the 23-field record
print(json.dumps(resume, indent=2, ensure_ascii=False))

Use greedy decoding (do_sample=False) for stable structured output. For long multi-page resumes, raise max_new_tokens if you see truncated JSON.

vLLM serving (the original deployment target):

vllm serve sukhrobnurali/qwen3vl-resume-parser \
  --dtype bfloat16 --max-model-len 16384 --trust-remote-code

When calling through the OpenAI-compatible API, pass extra_body={"chat_template_kwargs": {"enable_thinking": false}} to keep the model in non-thinking (direct-JSON) mode.

Limitations

  • Domain skew. Training resumes skew toward IT/software roles, and the enum vocabularies (roles, technologies, industries) are IT-centric. Expect degradation on non-technical resumes, unusual layouts, scans/photos, or handwriting.
  • Language. English-dominant; non-English resumes are under-represented.
  • Schema lock-in. The model is tuned to one specific 23-field schema and its enum lists. It will coerce values toward those vocabularies (including "Other"), which may not match a different downstream schema.
  • Invalid JSON happens (~12% on the held-out split). Always parse defensively.
  • Latency. ~90 s/resume on an A100 at 16K context — batch/offline, not real-time.
  • Quantization. BF16 peaks at ~23 GB VRAM; it runs in 4-bit on a 16 GB GPU, but accuracy was only measured in BF16.

Out-of-scope and responsible use

  • No automated candidate decisions. Resume parsing for screening/ranking carries fairness and bias risk. Keep a human in the loop; do not use this model to make or materially influence hiring decisions without review.
  • Not a general VQA / OCR model. It is specialized for this resume schema.
  • PII. Resumes contain personal data. Handle outputs under the applicable privacy law (e.g. GDPR) — secure storage, access control, retention limits, and a lawful basis for processing.
  • Verify before trusting. Outputs are model predictions, not ground truth; validate critical fields (contact info, dates) downstream.

License

Released under Apache-2.0, inherited from the Qwen/Qwen3-VL-8B-Instruct base model.

Citation

@misc{nurali2026qwen3vlresumeparser,
  title        = {qwen3vl-resume-parser: a Qwen3-VL-8B fine-tune for resume-to-JSON extraction},
  author       = {Nurali, Sukhrob},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/sukhrobnurali/qwen3vl-resume-parser}}
}

Built on Qwen3-VL by the Qwen team; see the Qwen3-VL model card and Unsloth for the training stack.

Author

Sukhrob Nurali — sukhrobnurali@gmail.com Hugging Face: @sukhrobnurali · GitHub: @sukhrobnurali

Downloads last month
8
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sukhrobnurali/qwen3vl-resume-parser

Finetuned
(295)
this model