Qwen3-VL BBox-DocVQA Fine-tuned Model

This model is a fine-tuned version of Qwen/Qwen3-VL-2B-Instruct on the BBox-DocVQA dataset for document grounding tasks.

Model Description

Fine-tuned to:

Answer questions about document images
Predict bounding boxes around answer regions
Output coordinates in 0-1000 normalized scale

Training

Base model: Qwen/Qwen3-VL-2B-Instruct
Dataset: Yuwh07/BBox_DocVQA_Train
Method: LoRA/QLoRA fine-tuning with TRL SFTTrainer

Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained("hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft")
processor = AutoProcessor.from_pretrained("hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft")

# Use with vLLM for inference
# See bbox_docvqa evaluation pipeline for details

Output Format

<answer>Your answer here</answer>
<boxes>
<box page="1">(x1,y1),(x2,y2)</box>
</boxes>

Coordinates are normalized to 0-1000 scale.

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(200)

this model