|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- Codatta/MM-Food-100K |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-3B-Instruct |
|
|
pipeline_tag: image-to-text |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
--- |
|
|
|
|
|
# π₯ NutriScan-3B (MedGemma Edition) |
|
|
|
|
|
**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it acts as the intelligent "Vision Layer" for AI health pipelines. |
|
|
|
|
|
It is fine-tuned on **Qwen2.5-VL-3B-Instruct**, bridging the gap between raw culinary images and medical-grade nutritional analysis. |
|
|
|
|
|
--- |
|
|
|
|
|
### π Key Features |
|
|
|
|
|
* **Food Recognition:** Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels. |
|
|
* **Ingredient Breakdown:** Detects visible components (e.g., "lentils, cream, cilantro garnish"). |
|
|
* **Structured Output:** Generates clean, parsable **JSON** containing calories, macronutrients, and portion estimates. |
|
|
* **Efficient:** Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization. |
|
|
|
|
|
--- |
|
|
|
|
|
### π Dataset & Transparency |
|
|
|
|
|
This model was fine-tuned on the **Codatta/MM-Food-100K** dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset: |
|
|
|
|
|
* **Total Training Images:** **9,281** high-quality samples. |
|
|
* **Filename Note:** Image filenames (e.g., `food_099996.jpg`) preserve their **original index** from the source dataset. |
|
|
* *Clarification:* You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error. |
|
|
|
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
### π Quick Start |
|
|
|
|
|
You must install the latest transformers libraries to support Qwen2.5-VL. |
|
|
|
|
|
```bash |
|
|
pip install git+https://github.com/huggingface/transformers |
|
|
pip install peft accelerate bitsandbytes qwen-vl-utils |
|
|
|
|
|
``` |
|
|
|
|
|
#### **Inference Code** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from peft import PeftModel |
|
|
from qwen_vl_utils import process_vision_info |
|
|
|
|
|
# 1. Load Model & Adapter |
|
|
base_model = "Qwen/Qwen2.5-VL-3B-Instruct" |
|
|
adapter_model = "HackerAditya56/NutriScan-3B" |
|
|
|
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
base_model, torch_dtype=torch.float16, device_map="auto" |
|
|
) |
|
|
model = PeftModel.from_pretrained(model, adapter_model) |
|
|
processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28) |
|
|
|
|
|
# 2. Run Analysis |
|
|
def scan_food(image_path): |
|
|
image = Image.open(image_path).convert("RGB") |
|
|
|
|
|
# We use a specific prompt to force JSON output |
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."} |
|
|
] |
|
|
}] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt" |
|
|
).to("cuda") |
|
|
|
|
|
generated_ids = model.generate(**inputs, max_new_tokens=512) |
|
|
return processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
|
|
# Test |
|
|
print(scan_food("my_lunch.jpg")) |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### π Example Output |
|
|
|
|
|
**Input:** Image of a pepperoni pizza. |
|
|
**Model Output:** |
|
|
|
|
|
```json |
|
|
{ |
|
|
"dish_name": "Pepperoni Pizza", |
|
|
"ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"], |
|
|
"nutritional_profile": { |
|
|
"calories_per_slice": 280, |
|
|
"protein": "12g", |
|
|
"fat": "10g", |
|
|
"carbs": "35g" |
|
|
}, |
|
|
"health_note": "Contains processed meat and high sodium." |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### π§ Technical Specs |
|
|
|
|
|
* **Base Architecture:** Qwen2.5-VL (Vision-Language) |
|
|
* **Fine-Tuning Method:** QLoRA (Rank 16, Alpha 16) |
|
|
* **Precision:** 4-bit NF4 (Normal Float 4) |
|
|
* **Training Hardware:** NVIDIA T4 GPUs (Kaggle) |
|
|
|
|
|
--- |
|
|
|
|
|
### β οΈ Disclaimer |
|
|
|
|
|
**Not Medical Advice.** This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only. |
|
|
|
|
|
--- |
|
|
|
|
|
### π¨βπ» Author |
|
|
|
|
|
**Aditya Nandan** (HackerAditya56) |
|
|
*Developed for the MedGemma Hackathon 2026* |