HackerAditya56
/

NutriScan-3B

@@ -17,24 +17,36 @@ pipeline_tag: image-to-text
 # 🥗 NutriScan-3B (MedGemma Edition)
-**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it serves as the "Vision Layer" in an AI health agent pipeline.
-It is fine-tuned on **Qwen2.5-VL-3B-Instruct** using the **Codatta/MM-Food-100K** dataset to act as a bridge between raw food images and medical analysis.
 ---
 ### 🚀 Key Features
-* **Food Recognition:** Identifies complex dishes (e.g., "Fried Chicken", "Paneer Butter Masala").
-* **Ingredient Breakdown:** Detects visible ingredients (e.g., "chicken, oil, breading").
-* **Structured Output:** Generates clean **JSON** containing calories, protein, fat, and carbs.
-* **Lightweight:** Runs on consumer GPUs (Colab T4, RTX 3050) using 4-bit quantization.
 ---
-### 📦 Installation
-To run NutriScan, you need the latest versions of the Hugging Face libraries (as Qwen2.5-VL is very new).
 ```bash
 pip install git+https://github.com/huggingface/transformers
@@ -42,11 +54,7 @@ pip install peft accelerate bitsandbytes qwen-vl-utils
 ```
----
-### 🐍 Quick Start (Python)
-Here is the easiest way to run the model on your own images.
 ```python
 import torch
@@ -55,94 +63,83 @@ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
 from peft import PeftModel
 from qwen_vl_utils import process_vision_info
-# 1. Load the Model
-base_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
-adapter_id = "HackerAditya56/NutriScan-3B"
-print("Loading NutriScan...")
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
-    base_model_id,
-    torch_dtype=torch.float16,
-    device_map="auto",
 )
-# Load the Fine-Tuned Adapter
-model = PeftModel.from_pretrained(model, adapter_id)
-processor = AutoProcessor.from_pretrained(base_model_id, min_pixels=256*28*28, max_pixels=1024*28*28)
-# 2. Prepare Image
-image_path = "your_food_image.jpg" # Replace with your image
-image = Image.open(image_path).convert("RGB")
-# 3. Run Inference
-messages = [
-    {
         "role": "user",
         "content": [
             {"type": "image", "image": image},
-            {"type": "text", "text": "Analyze this food image. Identify the dish, ingredients, and nutritional profile in JSON."}
         ]
-    }
-]
-text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    padding=True,
-    return_tensors="pt",
-).to("cuda")
-generated_ids = model.generate(**inputs, max_new_tokens=512)
-output_text = processor.batch_decode(
-    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
-)[0]
-print(output_text)
 ```
 ---
-### 📊 Example Output
-When you feed the model an image of a **Cheeseburger**, it outputs structured JSON like this:
 ```json
 {
-  "dish_name": "Cheeseburger",
-  "ingredients": ["beef patty", "cheese slice", "lettuce", "tomato", "sesame bun", "sauce"],
   "nutritional_profile": {
-    "calories": 550,
-    "protein_g": 30,
-    "fat_g": 35,
-    "carbohydrate_g": 45
   },
-  "portion_estimate": "1 burger (approx 250g)"
 }
 ```
 ---
-### 🔧 Training Details
-* **Base Model:** Qwen2.5-VL-3B-Instruct
-* **Dataset:** Subset of Codatta/MM-Food-100K (~10,000 high-quality samples)
-* **Hardware:** Trained on NVIDIA T4 x 2 (Kaggle)
-* **Technique:** QLoRA (4-bit quantization) with Rank 16 / Alpha 16.
-* **Objective:** The model was trained to ignore chatty conversation and focus strictly on visual recognition and JSON formatting.
 ---
 ### ⚠️ Disclaimer
-**Not Medical Advice:** This model provides nutritional estimates based on visual data. It cannot "see" hidden ingredients (like sugar or salt content) or exact cooking oils. Please use these values as rough guidelines, not medical facts.
 ---
 ### 👨‍💻 Author
 **Aditya Nandan** (HackerAditya56)
-*Built for the MedGemma Hackathon 2026*

 # 🥗 NutriScan-3B (MedGemma Edition)
+**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it acts as the intelligent "Vision Layer" for AI health pipelines.
+It is fine-tuned on **Qwen2.5-VL-3B-Instruct**, bridging the gap between raw culinary images and medical-grade nutritional analysis.
 ---
 ### 🚀 Key Features
+* **Food Recognition:** Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
+* **Ingredient Breakdown:** Detects visible components (e.g., "lentils, cream, cilantro garnish").
+* **Structured Output:** Generates clean, parsable **JSON** containing calories, macronutrients, and portion estimates.
+* **Efficient:** Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.
 ---
+### 📊 Dataset & Transparency
+This model was fine-tuned on the **Codatta/MM-Food-100K** dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:
+* **Total Training Images:** **9,281** high-quality samples.
+* **Filename Note:** Image filenames (e.g., `food_099996.jpg`) preserve their **original index** from the source dataset.
+* *Clarification:* You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.
+---
+### 🐍 Quick Start
+You must install the latest transformers libraries to support Qwen2.5-VL.
 ```bash
 pip install git+https://github.com/huggingface/transformers
 ```
+#### **Inference Code**
 ```python
 import torch
 from peft import PeftModel
 from qwen_vl_utils import process_vision_info
+# 1. Load Model & Adapter
+base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
+adapter_model = "HackerAditya56/NutriScan-3B"
 model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    base_model, torch_dtype=torch.float16, device_map="auto"
 )
+model = PeftModel.from_pretrained(model, adapter_model)
+processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28)
+# 2. Run Analysis
+def scan_food(image_path):
+    image = Image.open(image_path).convert("RGB")
+    # We use a specific prompt to force JSON output
+    messages = [{
         "role": "user",
         "content": [
             {"type": "image", "image": image},
+            {"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
         ]
+    }]
+    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = processor(
+        text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
+    ).to("cuda")
+    generated_ids = model.generate(**inputs, max_new_tokens=512)
+    return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
+# Test
+print(scan_food("my_lunch.jpg"))
 ```
 ---
+### 📈 Example Output
+**Input:** Image of a pepperoni pizza.
+**Model Output:**
 ```json
 {
+  "dish_name": "Pepperoni Pizza",
+  "ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
   "nutritional_profile": {
+    "calories_per_slice": 280,
+    "protein": "12g",
+    "fat": "10g",
+    "carbs": "35g"
   },
+  "health_note": "Contains processed meat and high sodium."
 }
 ```
 ---
+### 🔧 Technical Specs
+* **Base Architecture:** Qwen2.5-VL (Vision-Language)
+* **Fine-Tuning Method:** QLoRA (Rank 16, Alpha 16)
+* **Precision:** 4-bit NF4 (Normal Float 4)
+* **Training Hardware:** NVIDIA T4 GPUs (Kaggle)
 ---
 ### ⚠️ Disclaimer
+**Not Medical Advice.** This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.
 ---
 ### 👨‍💻 Author
 **Aditya Nandan** (HackerAditya56)
+*Developed for the MedGemma Hackathon 2026*