MechVL-4B-RL

The RL checkpoint of MechVL — the domain-specialized multimodal model for mechanical engineering drawing understanding, introduced in:

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding (ICML 2026)

arXiv GitHub

Model description

MechVL-4B-RL is obtained by further optimizing MechVL-4B-SFT with DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) in a two-stage self-play regime:

  • Stage 1 (full): DAPO over the full MechVQA training split.
  • Stage 2 (targeted): DAPO over a re-sampled subset with an increased proportion of underperforming subtasks.

Reward = Accuracy (LLM-as-a-Judge, semantic equivalence in [0,1]) + Format (binary, well-formed <think>/<answer>) + Quality (Logic / Professionalism / Conciseness).

Base model Qwen3-VL-4B-Instruct
Architecture Qwen3VLForConditionalGeneration
Stage 2 / 2 — RL (DAPO self-play, on top of SFT)
MechVQA Total 84.85 (best across open- & closed-source MLLMs)
SFT checkpoint xiaofengalg/MechVL-4B-SFT

Output format

The model reasons then answers, enclosed as:

<think> ...reasoning... </think><answer> ...final answer... </answer>

Parse the <answer>...</answer> span for the final answer (a regex like <answer>(.*?)</answer> works).

Usage (ModelScope)

import re, torch
from modelscope import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "xiaofengalg/MechVL-4B-RL", torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained("xiaofengalg/MechVL-4B-RL")

question = "图纸中标注的零件总长度是多少?"
messages = [{"role": "user", "content": [
    {"type": "image", "image": "path/to/drawing.png"},
    {"type": "text", "text": question},
]}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=4096, do_sample=True,
                     temperature=0.6, top_p=0.95, top_k=20)
text = processor.decode(out[0], skip_special_tokens=True)
m = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
print("answer:", m.group(1).strip() if m else text)

Also available on HuggingFace. For batch vLLM inference with the exact training-time format prompt, see scripts/batch_infer.py (MODE=rl) and prompts/mech_r1.jinja.

Results

On the MechVQA benchmark (Total score):

Model Total
GPT-5 75.44
Gemini-3-Pro-Preview 77.28
GLM-4.6V (best closed-source) 78.91
MechVL-4B-SFT 76.36
MechVL-4B-RL (this model) 84.85

See §6 of the paper for the full table and ablations.

Citation

@misc{kou2026mechvqabenchmarkingenhancingmultimodal,
      title={MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding},
      author={Qian Kou and Xiaofeng Shi and Yulin Li and Xiaosong Qiu and Xinyang Wang and Hua Zhou and Cao Dongxing},
      year={2026},
      eprint={2605.30794},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30794}
}

License

Apache-2.0.

Downloads last month
-
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MonteXiaofeng/MechVL-4B-RL

Paper for MonteXiaofeng/MechVL-4B-RL