NutriScan-3B / README.md
HackerAditya56's picture
Update README.md
2975ce4 verified
---
library_name: transformers
datasets:
- Codatta/MM-Food-100K
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-to-text
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
---
# πŸ₯— NutriScan-3B (MedGemma Edition)
**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it acts as the intelligent "Vision Layer" for AI health pipelines.
It is fine-tuned on **Qwen2.5-VL-3B-Instruct**, bridging the gap between raw culinary images and medical-grade nutritional analysis.
---
### πŸš€ Key Features
* **Food Recognition:** Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
* **Ingredient Breakdown:** Detects visible components (e.g., "lentils, cream, cilantro garnish").
* **Structured Output:** Generates clean, parsable **JSON** containing calories, macronutrients, and portion estimates.
* **Efficient:** Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.
---
### πŸ“Š Dataset & Transparency
This model was fine-tuned on the **Codatta/MM-Food-100K** dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:
* **Total Training Images:** **9,281** high-quality samples.
* **Filename Note:** Image filenames (e.g., `food_099996.jpg`) preserve their **original index** from the source dataset.
* *Clarification:* You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.
---
### 🐍 Quick Start
You must install the latest transformers libraries to support Qwen2.5-VL.
```bash
pip install git+https://github.com/huggingface/transformers
pip install peft accelerate bitsandbytes qwen-vl-utils
```
#### **Inference Code**
```python
import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info
# 1. Load Model & Adapter
base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_model = "HackerAditya56/NutriScan-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
base_model, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28)
# 2. Run Analysis
def scan_food(image_path):
image = Image.open(image_path).convert("RGB")
# We use a specific prompt to force JSON output
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
).to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=512)
return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
# Test
print(scan_food("my_lunch.jpg"))
```
---
### πŸ“ˆ Example Output
**Input:** Image of a pepperoni pizza.
**Model Output:**
```json
{
"dish_name": "Pepperoni Pizza",
"ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
"nutritional_profile": {
"calories_per_slice": 280,
"protein": "12g",
"fat": "10g",
"carbs": "35g"
},
"health_note": "Contains processed meat and high sodium."
}
```
---
### πŸ”§ Technical Specs
* **Base Architecture:** Qwen2.5-VL (Vision-Language)
* **Fine-Tuning Method:** QLoRA (Rank 16, Alpha 16)
* **Precision:** 4-bit NF4 (Normal Float 4)
* **Training Hardware:** NVIDIA T4 GPUs (Kaggle)
---
### ⚠️ Disclaimer
**Not Medical Advice.** This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.
---
### πŸ‘¨β€πŸ’» Author
**Aditya Nandan** (HackerAditya56)
*Developed for the MedGemma Hackathon 2026*