Model Card for Model ID
π₯ NutriScan-3B (MedGemma Edition)
NutriScan-3B is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the MedGemma Impact Challenge, it acts as the intelligent "Vision Layer" for AI health pipelines.
It is fine-tuned on Qwen2.5-VL-3B-Instruct, bridging the gap between raw culinary images and medical-grade nutritional analysis.
π Key Features
- Food Recognition: Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
- Ingredient Breakdown: Detects visible components (e.g., "lentils, cream, cilantro garnish").
- Structured Output: Generates clean, parsable JSON containing calories, macronutrients, and portion estimates.
- Efficient: Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.
π Dataset & Transparency
This model was fine-tuned on the Codatta/MM-Food-100K dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:
- Total Training Images: 9,281 high-quality samples.
- Filename Note: Image filenames (e.g.,
food_099996.jpg) preserve their original index from the source dataset. - Clarification: You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.
π Quick Start
You must install the latest transformers libraries to support Qwen2.5-VL.
pip install git+https://github.com/huggingface/transformers
pip install peft accelerate bitsandbytes qwen-vl-utils
Inference Code
import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
# 1. Load Model & Adapter
base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_model = "HackerAditya56/NutriScan-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28)
# 2. Run Analysis
def scan_food(image_path):
image = Image.open(image_path).convert("RGB")
# We use a specific prompt to force JSON output
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=512)
return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Test
print(scan_food("my_lunch.jpg"))
π Example Output
Input: Image of a pepperoni pizza. Model Output:
{
"dish_name": "Pepperoni Pizza",
"ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
"nutritional_profile": {
"calories_per_slice": 280,
"protein": "12g",
"fat": "10g",
"carbs": "35g"
},
"health_note": "Contains processed meat and high sodium."
}
π§ Technical Specs
- Base Architecture: Qwen2.5-VL (Vision-Language)
- Fine-Tuning Method: QLoRA (Rank 16, Alpha 16)
- Precision: 4-bit NF4 (Normal Float 4)
- Training Hardware: NVIDIA T4 GPUs (Kaggle)
β οΈ Disclaimer
Not Medical Advice. This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.
π¨βπ» Author
Aditya Nandan (HackerAditya56) Developed for the MedGemma Hackathon 2026
Model tree for HackerAditya56/NutriScan-3B
Base model
Qwen/Qwen2.5-VL-3B-Instruct