Qwen3-VL BBox-DocVQA Fine-tuned Model
This model is a fine-tuned version of Qwen/Qwen3-VL-2B-Instruct on the BBox-DocVQA dataset for document grounding tasks.
Model Description
Fine-tuned to:
- Answer questions about document images
- Predict bounding boxes around answer regions
- Output coordinates in 0-1000 normalized scale
Training
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Dataset: Yuwh07/BBox_DocVQA_Train
- Method: LoRA/QLoRA fine-tuning with TRL SFTTrainer
Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained("hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft")
processor = AutoProcessor.from_pretrained("hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft")
# Use with vLLM for inference
# See bbox_docvqa evaluation pipeline for details
Output Format
<answer>Your answer here</answer>
<boxes>
<box page="1">(x1,y1),(x2,y2)</box>
</boxes>
Coordinates are normalized to 0-1000 scale.
- Downloads last month
- 5
Model tree for hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft
Base model
Qwen/Qwen3-VL-2B-Instruct