MechVL-4B-RL

The RL checkpoint of MechVL — the domain-specialized multimodal model for mechanical engineering drawing understanding, introduced in:

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding (ICML 2026)

Model description

MechVL-4B-RL is obtained by further optimizing MechVL-4B-SFT with DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) in a two-stage self-play regime:

Stage 1 (full): DAPO over the full MechVQA training split.
Stage 2 (targeted): DAPO over a re-sampled subset with an increased proportion of underperforming subtasks.

Reward = Accuracy (LLM-as-a-Judge, semantic equivalence in [0,1]) + Format (binary, well-formed <think>/<answer>) + Quality (Logic / Professionalism / Conciseness).


Base model	Qwen3-VL-4B-Instruct
Architecture	Qwen3VLForConditionalGeneration
Stage	2 / 2 — RL (DAPO self-play, on top of SFT)
MechVQA Total	84.85 (best across open- & closed-source MLLMs)
SFT checkpoint	xiaofengalg/MechVL-4B-SFT

Output format

The model reasons then answers, enclosed as:

<think> ...reasoning... </think><answer> ...final answer... </answer>

Parse the <answer>...</answer> span for the final answer (a regex like <answer>(.*?)</answer> works).

Usage (ModelScope)

import re, torch
from modelscope import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "xiaofengalg/MechVL-4B-RL", torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained("xiaofengalg/MechVL-4B-RL")

question = "图纸中标注的零件总长度是多少？"
messages = [{"role": "user", "content": [
    {"type": "image", "image": "path/to/drawing.png"},
    {"type": "text", "text": question},
]}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=4096, do_sample=True,
                     temperature=0.6, top_p=0.95, top_k=20)
text = processor.decode(out[0], skip_special_tokens=True)
m = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
print("answer:", m.group(1).strip() if m else text)

Also available on HuggingFace. For batch vLLM inference with the exact training-time format prompt, see scripts/batch_infer.py (MODE=rl) and prompts/mech_r1.jinja.

Results

On the MechVQA benchmark (Total score):

Model	Total
GPT-5	75.44
Gemini-3-Pro-Preview	77.28
GLM-4.6V (best closed-source)	78.91
MechVL-4B-SFT	76.36
MechVL-4B-RL (this model)	84.85

See §6 of the paper for the full table and ablations.

Citation

@misc{kou2026mechvqabenchmarkingenhancingmultimodal,
      title={MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding},
      author={Qian Kou and Xiaofeng Shi and Yulin Li and Xiaosong Qiu and Xinyang Wang and Hua Zhou and Cao Dongxing},
      year={2026},
      eprint={2605.30794},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30794}
}

License

Apache-2.0.

Downloads last month: -

Safetensors

Model size

5B params

Tensor type

BF16

Collection including MonteXiaofeng/MechVL-4B-RL

MechVQA

Collection

3 items • Updated about 8 hours ago

Paper for MonteXiaofeng/MechVL-4B-RL

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

Paper • 2605.30794 • Published 29 days ago • 5