File size: 4,476 Bytes
ce85f05 909aa01 ce85f05 08ce0cc ce85f05 08ce0cc ce85f05 2975ce4 ce85f05 2975ce4 ce85f05 08ce0cc ce85f05 08ce0cc ce85f05 2975ce4 ce85f05 08ce0cc ce85f05 2975ce4 ce85f05 2975ce4 ce85f05 08ce0cc ce85f05 08ce0cc ce85f05 2975ce4 08ce0cc 2975ce4 08ce0cc 2975ce4 08ce0cc 2975ce4 08ce0cc 2975ce4 08ce0cc 2975ce4 08ce0cc ce85f05 08ce0cc ce85f05 2975ce4 ce85f05 2975ce4 ce85f05 08ce0cc 2975ce4 08ce0cc 2975ce4 08ce0cc 2975ce4 08ce0cc ce85f05 08ce0cc ce85f05 08ce0cc ce85f05 2975ce4 ce85f05 2975ce4 ce85f05 08ce0cc ce85f05 08ce0cc ce85f05 2975ce4 ce85f05 08ce0cc ce85f05 08ce0cc ce85f05 08ce0cc 2975ce4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
library_name: transformers
datasets:
- Codatta/MM-Food-100K
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-to-text
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
---
# π₯ NutriScan-3B (MedGemma Edition)
**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it acts as the intelligent "Vision Layer" for AI health pipelines.
It is fine-tuned on **Qwen2.5-VL-3B-Instruct**, bridging the gap between raw culinary images and medical-grade nutritional analysis.
---
### π Key Features
* **Food Recognition:** Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
* **Ingredient Breakdown:** Detects visible components (e.g., "lentils, cream, cilantro garnish").
* **Structured Output:** Generates clean, parsable **JSON** containing calories, macronutrients, and portion estimates.
* **Efficient:** Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.
---
### π Dataset & Transparency
This model was fine-tuned on the **Codatta/MM-Food-100K** dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:
* **Total Training Images:** **9,281** high-quality samples.
* **Filename Note:** Image filenames (e.g., `food_099996.jpg`) preserve their **original index** from the source dataset.
* *Clarification:* You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.
---
### π Quick Start
You must install the latest transformers libraries to support Qwen2.5-VL.
```bash
pip install git+https://github.com/huggingface/transformers
pip install peft accelerate bitsandbytes qwen-vl-utils
```
#### **Inference Code**
```python
import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
# 1. Load Model & Adapter
base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_model = "HackerAditya56/NutriScan-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28)
# 2. Run Analysis
def scan_food(image_path):
image = Image.open(image_path).convert("RGB")
# We use a specific prompt to force JSON output
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=512)
return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Test
print(scan_food("my_lunch.jpg"))
```
---
### π Example Output
**Input:** Image of a pepperoni pizza.
**Model Output:**
```json
{
"dish_name": "Pepperoni Pizza",
"ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
"nutritional_profile": {
"calories_per_slice": 280,
"protein": "12g",
"fat": "10g",
"carbs": "35g"
},
"health_note": "Contains processed meat and high sodium."
}
```
---
### π§ Technical Specs
* **Base Architecture:** Qwen2.5-VL (Vision-Language)
* **Fine-Tuning Method:** QLoRA (Rank 16, Alpha 16)
* **Precision:** 4-bit NF4 (Normal Float 4)
* **Training Hardware:** NVIDIA T4 GPUs (Kaggle)
---
### β οΈ Disclaimer
**Not Medical Advice.** This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.
---
### π¨βπ» Author
**Aditya Nandan** (HackerAditya56)
*Developed for the MedGemma Hackathon 2026* |