NutriScan-3B / README.md

Update README.md

2975ce4 verified 8 days ago

4.48 kB

	---
	library_name: transformers
	datasets:
	- Codatta/MM-Food-100K
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	pipeline_tag: image-to-text
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->

	---

	# 🥗 NutriScan-3B (MedGemma Edition)

	NutriScan-3B is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the MedGemma Impact Challenge, it acts as the intelligent "Vision Layer" for AI health pipelines.

	It is fine-tuned on Qwen2.5-VL-3B-Instruct, bridging the gap between raw culinary images and medical-grade nutritional analysis.

	---

	### 🚀 Key Features

	* Food Recognition: Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
	* Ingredient Breakdown: Detects visible components (e.g., "lentils, cream, cilantro garnish").
	* Structured Output: Generates clean, parsable JSON containing calories, macronutrients, and portion estimates.
	* Efficient: Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.

	---

	### 📊 Dataset & Transparency

	This model was fine-tuned on the Codatta/MM-Food-100K dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:

	* Total Training Images: 9,281 high-quality samples.
	* Filename Note: Image filenames (e.g., `food_099996.jpg`) preserve their original index from the source dataset.
	* Clarification: You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.



	---

	### 🐍 Quick Start

	You must install the latest transformers libraries to support Qwen2.5-VL.

	```bash
	pip install git+https://github.com/huggingface/transformers
	pip install peft accelerate bitsandbytes qwen-vl-utils

	```

	#### Inference Code

	```python
	import torch
	from PIL import Image
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from peft import PeftModel
	from qwen_vl_utils import process_vision_info

	# 1. Load Model & Adapter
	base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
	adapter_model = "HackerAditya56/NutriScan-3B"

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	base_model, torch_dtype=torch.float16, device_map="auto"
	)
	model = PeftModel.from_pretrained(model, adapter_model)
	processor = AutoProcessor.from_pretrained(base_model, min_pixels=2562828, max_pixels=10242828)

	# 2. Run Analysis
	def scan_food(image_path):
	image = Image.open(image_path).convert("RGB")

	# We use a specific prompt to force JSON output
	messages = [{
	"role": "user",
	"content": [
	{"type": "image", "image": image},
	{"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
	]
	}]

	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
	).to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=512)
	return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

	# Test
	print(scan_food("my_lunch.jpg"))

	```

	---

	### 📈 Example Output

	Input: Image of a pepperoni pizza.
	Model Output:

	```json
	{
	"dish_name": "Pepperoni Pizza",
	"ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
	"nutritional_profile": {
	"calories_per_slice": 280,
	"protein": "12g",
	"fat": "10g",
	"carbs": "35g"
	},
	"health_note": "Contains processed meat and high sodium."
	}

	```

	---

	### 🔧 Technical Specs

	* Base Architecture: Qwen2.5-VL (Vision-Language)
	* Fine-Tuning Method: QLoRA (Rank 16, Alpha 16)
	* Precision: 4-bit NF4 (Normal Float 4)
	* Training Hardware: NVIDIA T4 GPUs (Kaggle)

	---

	### ⚠️ Disclaimer

	Not Medical Advice. This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.

	---

	### 👨‍💻 Author

	Aditya Nandan (HackerAditya56)
	Developed for the MedGemma Hackathon 2026