Qwen2-VL Nutrition Table Detector

A fine-tuned Qwen2-VL-7B-Instruct model for detecting nutrition table bounding boxes in food packaging images.

This model represents the best performing training strategy after systematically comparing 4 different fine-tuning approaches, achieving a +780% improvement over the base model.

Model Description

  • Model Type: Vision-Language Model (VLM) for object detection
  • Base Model: Qwen/Qwen2-VL-7B-Instruct
  • Task: Bounding box detection of nutrition tables
  • Precision: bfloat16
  • Model Size: ~15.5 GB

Performance

Evaluated on 50 test samples from the OpenFoodFacts nutrition-table-detection dataset:

Metric Value
Mean IoU 0.8636
Detection Rate 100%
IoU > 0.5 92%
IoU > 0.7 88%

Recipe Comparison

We systematically evaluated 4 different fine-tuning strategies to identify the optimal approach:

Recipe Strategy Mean IoU Det Rate IoU>0.5 IoU>0.7
Base (no fine-tuning) - 0.0981 18% 10% 6%
r1-llm-only LLM LoRA only 0.8349 100% 86% 82%
r2-vision-only Vision encoder full fine-tune 0.8330 100% 88% 82%
r3-two-stage Vision first, then LLM LoRA 0.8366 100% 90% 80%
r4-joint (this model) Joint LoRA (vision + LLM) 0.8636 100% 92% 88%

Key Finding: Joint LoRA training on both vision encoder and LLM achieves the best results, improving Mean IoU by +780% compared to the base model.

Training Details

Method

  • Fine-tuning Approach: 4-bit NF4 QLoRA (Quantized Low-Rank Adaptation)
  • Strategy: Joint LoRA on both vision encoder and LLM

LoRA Configuration

Parameter Value
Rank (r) 64
Alpha 128
Dropout 0.1
Target Modules (LLM) q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Target Modules (Vision) qkv, fc1, fc2, attn.proj
Trainable Parameters 161M (1.9% of 8.3B total)

Training Arguments

Parameter Value
Epochs 3
Batch Size 1
Gradient Accumulation Steps 8
Learning Rate 2e-5
Scheduler Cosine with warmup
Warmup Steps 100
Optimizer AdamW
Weight Decay 0.01
Max Sequence Length 2048
Precision bfloat16

Dataset

Hardware

  • Trained on 2x NVIDIA RTX 6000 Ada (48GB VRAM each)
  • Model parallelism via device_map="balanced"

Usage

Python Inference

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "DanJZY/qwen2vl-nutrition-table-detector",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "DanJZY/qwen2vl-nutrition-table-detector",
    trust_remote_code=True
)

# Prepare input
image = Image.open("food_packaging.jpg")
messages = [
    {
        "role": "system",
        "content": "You are an expert at detecting nutrition tables in images. When asked to detect a nutrition table, output the bounding box coordinates."
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Detect the bounding box of the nutrition table in this image."}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(response)

vLLM Serving

For production deployment, we recommend using vLLM for efficient serving:

# Start vLLM server
vllm serve DanJZY/qwen2vl-nutrition-table-detector \
    --served-model-name qwen2vl-nutrition \
    --dtype bfloat16 \
    --trust-remote-code \
    --max-model-len 4096 \
    --limit-mm-per-prompt '{"image":1}' \
    --gpu-memory-utilization 0.9 \
    --port 8000

vLLM Performance (RTX 3090, 24GB):

Concurrency Throughput Avg E2E Latency
c=1 1.09 req/s 907 ms
c=8 3.17 req/s 2,138 ms
c=8 (prefix cache) 11.40 req/s 534 ms

Output Format

The model outputs bounding boxes in the Qwen2-VL format:

<|object_ref_start|>nutrition_table<|object_ref_end|><|box_start|>(x1,y1),(x2,y2)<|box_end|>

Where coordinates are normalized to [0, 1000).

Related Models

Limitations

  • Single Table Detection: Optimized for detecting one nutrition table per image
  • Domain Specific: Best performance on food packaging images similar to the training data
  • Image Resolution: Works best with images where nutrition tables are clearly visible
  • Language: Primarily trained on English nutrition labels

Citation

If you use this model, please cite:

@misc{qwen2vl-nutrition-detector,
  author = {DanJZY},
  title = {Qwen2-VL Nutrition Table Detector},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/DanJZY/qwen2vl-nutrition-table-detector}
}

Acknowledgments

Downloads last month
463
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DanJZY/qwen2vl-nutrition-table-detector

Base model

Qwen/Qwen2-VL-7B
Adapter
(210)
this model

Dataset used to train DanJZY/qwen2vl-nutrition-table-detector