Qwen2-VL Nutrition Table Detector
A fine-tuned Qwen2-VL-7B-Instruct model for detecting nutrition table bounding boxes in food packaging images.
This model represents the best performing training strategy after systematically comparing 4 different fine-tuning approaches, achieving a +780% improvement over the base model.
Model Description
- Model Type: Vision-Language Model (VLM) for object detection
- Base Model: Qwen/Qwen2-VL-7B-Instruct
- Task: Bounding box detection of nutrition tables
- Precision: bfloat16
- Model Size: ~15.5 GB
Performance
Evaluated on 50 test samples from the OpenFoodFacts nutrition-table-detection dataset:
| Metric | Value |
|---|---|
| Mean IoU | 0.8636 |
| Detection Rate | 100% |
| IoU > 0.5 | 92% |
| IoU > 0.7 | 88% |
Recipe Comparison
We systematically evaluated 4 different fine-tuning strategies to identify the optimal approach:
| Recipe | Strategy | Mean IoU | Det Rate | IoU>0.5 | IoU>0.7 |
|---|---|---|---|---|---|
| Base (no fine-tuning) | - | 0.0981 | 18% | 10% | 6% |
| r1-llm-only | LLM LoRA only | 0.8349 | 100% | 86% | 82% |
| r2-vision-only | Vision encoder full fine-tune | 0.8330 | 100% | 88% | 82% |
| r3-two-stage | Vision first, then LLM LoRA | 0.8366 | 100% | 90% | 80% |
| r4-joint (this model) | Joint LoRA (vision + LLM) | 0.8636 | 100% | 92% | 88% |
Key Finding: Joint LoRA training on both vision encoder and LLM achieves the best results, improving Mean IoU by +780% compared to the base model.
Training Details
Method
- Fine-tuning Approach: 4-bit NF4 QLoRA (Quantized Low-Rank Adaptation)
- Strategy: Joint LoRA on both vision encoder and LLM
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.1 |
| Target Modules (LLM) | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Target Modules (Vision) | qkv, fc1, fc2, attn.proj |
| Trainable Parameters | 161M (1.9% of 8.3B total) |
Training Arguments
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch Size | 1 |
| Gradient Accumulation Steps | 8 |
| Learning Rate | 2e-5 |
| Scheduler | Cosine with warmup |
| Warmup Steps | 100 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Max Sequence Length | 2048 |
| Precision | bfloat16 |
Dataset
- Source: openfoodfacts/nutrition-table-detection
- Training Samples: ~1,083
- Task: Detect bounding boxes of nutrition tables in food packaging images
Hardware
- Trained on 2x NVIDIA RTX 6000 Ada (48GB VRAM each)
- Model parallelism via
device_map="balanced"
Usage
Python Inference
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
"DanJZY/qwen2vl-nutrition-table-detector",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"DanJZY/qwen2vl-nutrition-table-detector",
trust_remote_code=True
)
# Prepare input
image = Image.open("food_packaging.jpg")
messages = [
{
"role": "system",
"content": "You are an expert at detecting nutrition tables in images. When asked to detect a nutrition table, output the bounding box coordinates."
},
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Detect the bounding box of the nutrition table in this image."}
]
}
]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
padding=True
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(response)
vLLM Serving
For production deployment, we recommend using vLLM for efficient serving:
# Start vLLM server
vllm serve DanJZY/qwen2vl-nutrition-table-detector \
--served-model-name qwen2vl-nutrition \
--dtype bfloat16 \
--trust-remote-code \
--max-model-len 4096 \
--limit-mm-per-prompt '{"image":1}' \
--gpu-memory-utilization 0.9 \
--port 8000
vLLM Performance (RTX 3090, 24GB):
| Concurrency | Throughput | Avg E2E Latency |
|---|---|---|
| c=1 | 1.09 req/s | 907 ms |
| c=8 | 3.17 req/s | 2,138 ms |
| c=8 (prefix cache) | 11.40 req/s | 534 ms |
Output Format
The model outputs bounding boxes in the Qwen2-VL format:
<|object_ref_start|>nutrition_table<|object_ref_end|><|box_start|>(x1,y1),(x2,y2)<|box_end|>
Where coordinates are normalized to [0, 1000).
Related Models
- GPTQ INT4 Version: DanJZY/qwen2vl-nutrition-table-detector-GPTQ-INT4 - 2.4x smaller with negligible accuracy loss
Limitations
- Single Table Detection: Optimized for detecting one nutrition table per image
- Domain Specific: Best performance on food packaging images similar to the training data
- Image Resolution: Works best with images where nutrition tables are clearly visible
- Language: Primarily trained on English nutrition labels
Citation
If you use this model, please cite:
@misc{qwen2vl-nutrition-detector,
author = {DanJZY},
title = {Qwen2-VL Nutrition Table Detector},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/DanJZY/qwen2vl-nutrition-table-detector}
}
Acknowledgments
- Qwen Team for the base Qwen2-VL model
- Open Food Facts for the nutrition table detection dataset
- Hugging Face for the transformers and TRL libraries
- Downloads last month
- 463