Instructions to use MonteXiaofeng/MechVL-4B-RL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MonteXiaofeng/MechVL-4B-RL with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="MonteXiaofeng/MechVL-4B-RL")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("MonteXiaofeng/MechVL-4B-RL") model = AutoModelForMultimodalLM.from_pretrained("MonteXiaofeng/MechVL-4B-RL") - Notebooks
- Google Colab
- Kaggle
MechVL-4B-RL
The RL checkpoint of MechVL — the domain-specialized multimodal model for mechanical engineering drawing understanding, introduced in:
MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding (ICML 2026)
Model description
MechVL-4B-RL is obtained by further optimizing MechVL-4B-SFT with DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) in a two-stage self-play regime:
- Stage 1 (full): DAPO over the full MechVQA training split.
- Stage 2 (targeted): DAPO over a re-sampled subset with an increased proportion of underperforming subtasks.
Reward = Accuracy (LLM-as-a-Judge, semantic equivalence in [0,1]) + Format (binary, well-formed <think>/<answer>) + Quality (Logic / Professionalism / Conciseness).
| Base model | Qwen3-VL-4B-Instruct |
| Architecture | Qwen3VLForConditionalGeneration |
| Stage | 2 / 2 — RL (DAPO self-play, on top of SFT) |
| MechVQA Total | 84.85 (best across open- & closed-source MLLMs) |
| SFT checkpoint | xiaofengalg/MechVL-4B-SFT |
Output format
The model reasons then answers, enclosed as:
<think> ...reasoning... </think><answer> ...final answer... </answer>
Parse the <answer>...</answer> span for the final answer (a regex like <answer>(.*?)</answer> works).
Usage (ModelScope)
import re, torch
from modelscope import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"xiaofengalg/MechVL-4B-RL", torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained("xiaofengalg/MechVL-4B-RL")
question = "å›¾çº¸ä¸æ ‡æ³¨çš„零件总长度是多少?"
messages = [{"role": "user", "content": [
{"type": "image", "image": "path/to/drawing.png"},
{"type": "text", "text": question},
]}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=4096, do_sample=True,
temperature=0.6, top_p=0.95, top_k=20)
text = processor.decode(out[0], skip_special_tokens=True)
m = re.search(r"<answer>(.*?)</answer>", text, re.DOTALL)
print("answer:", m.group(1).strip() if m else text)
Also available on HuggingFace. For batch vLLM inference with the exact training-time format prompt, see scripts/batch_infer.py (MODE=rl) and prompts/mech_r1.jinja.
Results
On the MechVQA benchmark (Total score):
| Model | Total |
|---|---|
| GPT-5 | 75.44 |
| Gemini-3-Pro-Preview | 77.28 |
| GLM-4.6V (best closed-source) | 78.91 |
| MechVL-4B-SFT | 76.36 |
| MechVL-4B-RL (this model) | 84.85 |
See §6 of the paper for the full table and ablations.
Citation
@misc{kou2026mechvqabenchmarkingenhancingmultimodal,
title={MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding},
author={Qian Kou and Xiaofeng Shi and Yulin Li and Xiaosong Qiu and Xinyang Wang and Hua Zhou and Cao Dongxing},
year={2026},
eprint={2605.30794},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.30794}
}
License
Apache-2.0.
- Downloads last month
- -