Qwen3-VL BBox-DocVQA Fine-tuned Model

This model is a fine-tuned version of Qwen/Qwen3-VL-2B-Instruct on the BBox-DocVQA dataset for document grounding tasks.

Model Description

Fine-tuned to:

  • Answer questions about document images
  • Predict bounding boxes around answer regions
  • Output coordinates in 0-1000 normalized scale

Training

  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Dataset: Yuwh07/BBox_DocVQA_Train
  • Method: LoRA/QLoRA fine-tuning with TRL SFTTrainer

Usage

from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained("hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft")
processor = AutoProcessor.from_pretrained("hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft")

# Use with vLLM for inference
# See bbox_docvqa evaluation pipeline for details

Output Format

<answer>Your answer here</answer>
<boxes>
<box page="1">(x1,y1),(x2,y2)</box>
</boxes>

Coordinates are normalized to 0-1000 scale.

Downloads last month
5
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hrishikeshdk26/bbox-docvqa-qwen3-vl-2b-sft

Finetuned
(200)
this model