Qwen2-VL Nutrition Table Detector

A fine-tuned Qwen2-VL-7B-Instruct model for detecting nutrition table bounding boxes in food packaging images.

This model represents the best performing training strategy after systematically comparing 4 different fine-tuning approaches, achieving a +780% improvement over the base model.

Model Description

Model Type: Vision-Language Model (VLM) for object detection
Base Model: Qwen/Qwen2-VL-7B-Instruct
Task: Bounding box detection of nutrition tables
Precision: bfloat16
Model Size: ~15.5 GB

Performance

Evaluated on 50 test samples from the OpenFoodFacts nutrition-table-detection dataset:

Metric	Value
Mean IoU	0.8636
Detection Rate	100%
IoU > 0.5	92%
IoU > 0.7	88%

Recipe Comparison

We systematically evaluated 4 different fine-tuning strategies to identify the optimal approach:

Recipe	Strategy	Mean IoU	Det Rate	IoU>0.5	IoU>0.7
Base (no fine-tuning)	-	0.0981	18%	10%	6%
r1-llm-only	LLM LoRA only	0.8349	100%	86%	82%
r2-vision-only	Vision encoder full fine-tune	0.8330	100%	88%	82%
r3-two-stage	Vision first, then LLM LoRA	0.8366	100%	90%	80%
r4-joint (this model)	Joint LoRA (vision + LLM)	0.8636	100%	92%	88%

Key Finding: Joint LoRA training on both vision encoder and LLM achieves the best results, improving Mean IoU by +780% compared to the base model.

Training Details

Method

Fine-tuning Approach: 4-bit NF4 QLoRA (Quantized Low-Rank Adaptation)
Strategy: Joint LoRA on both vision encoder and LLM

LoRA Configuration

Parameter	Value
Rank (r)	64
Alpha	128
Dropout	0.1
Target Modules (LLM)	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Target Modules (Vision)	qkv, fc1, fc2, attn.proj
Trainable Parameters	161M (1.9% of 8.3B total)

Training Arguments

Parameter	Value
Epochs	3
Batch Size	1
Gradient Accumulation Steps	8
Learning Rate	2e-5
Scheduler	Cosine with warmup
Warmup Steps	100
Optimizer	AdamW
Weight Decay	0.01
Max Sequence Length	2048
Precision	bfloat16

Dataset

Source: openfoodfacts/nutrition-table-detection
Training Samples: ~1,083
Task: Detect bounding boxes of nutrition tables in food packaging images

Hardware

Trained on 2x NVIDIA RTX 6000 Ada (48GB VRAM each)
Model parallelism via device_map="balanced"

Usage

Python Inference

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DanJZY/qwen2vl-nutrition-table-detector",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "DanJZY/qwen2vl-nutrition-table-detector",
    trust_remote_code=True
)

# Prepare input
image = Image.open("food_packaging.jpg")
messages = [
    {
        "role": "system",
        "content": "You are an expert at detecting nutrition tables in images. When asked to detect a nutrition table, output the bounding box coordinates."
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Detect the bounding box of the nutrition table in this image."}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(response)

vLLM Serving

For production deployment, we recommend using vLLM for efficient serving:

# Start vLLM server
vllm serve DanJZY/qwen2vl-nutrition-table-detector \
    --served-model-name qwen2vl-nutrition \
    --dtype bfloat16 \
    --trust-remote-code \
    --max-model-len 4096 \
    --limit-mm-per-prompt '{"image":1}' \
    --gpu-memory-utilization 0.9 \
    --port 8000

vLLM Performance (RTX 3090, 24GB):

Concurrency	Throughput	Avg E2E Latency
c=1	1.09 req/s	907 ms
c=8	3.17 req/s	2,138 ms
c=8 (prefix cache)	11.40 req/s	534 ms

Output Format

The model outputs bounding boxes in the Qwen2-VL format:

<|object_ref_start|>nutrition_table<|object_ref_end|><|box_start|>(x1,y1),(x2,y2)<|box_end|>

Where coordinates are normalized to [0, 1000).

Related Models

GPTQ INT4 Version: DanJZY/qwen2vl-nutrition-table-detector-GPTQ-INT4 - 2.4x smaller with negligible accuracy loss

Limitations

Single Table Detection: Optimized for detecting one nutrition table per image
Domain Specific: Best performance on food packaging images similar to the training data
Image Resolution: Works best with images where nutrition tables are clearly visible
Language: Primarily trained on English nutrition labels

Citation

If you use this model, please cite:

@misc{qwen2vl-nutrition-detector,
  author = {DanJZY},
  title = {Qwen2-VL Nutrition Table Detector},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/DanJZY/qwen2vl-nutrition-table-detector}
}

Acknowledgments

Qwen Team for the base Qwen2-VL model
Open Food Facts for the nutrition table detection dataset
Hugging Face for the transformers and TRL libraries

Downloads last month: 463

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for DanJZY/qwen2vl-nutrition-table-detector

Base model

Qwen/Qwen2-VL-7B

Finetuned

Qwen/Qwen2-VL-7B-Instruct

Adapter

(210)

this model

DanJZY
/

qwen2vl-nutrition-table-detector