File size: 4,476 Bytes

ce85f05
 
909aa01
 
 
 
 
 
 
ce85f05
 
 
 
 
 
08ce0cc
ce85f05
08ce0cc
ce85f05
2975ce4
ce85f05
2975ce4
ce85f05
08ce0cc
ce85f05
08ce0cc
ce85f05
2975ce4
 
 
 
ce85f05
08ce0cc
ce85f05
2975ce4
ce85f05
2975ce4
 
 
 
 
 
 
 
 
 
 
 
 
ce85f05
08ce0cc
 
 
ce85f05
08ce0cc
ce85f05
2975ce4
08ce0cc
 
 
 
 
 
 
 
2975ce4
 
 
08ce0cc
 
2975ce4
08ce0cc
2975ce4
 
 
 
 
 
 
 
 
08ce0cc
 
 
2975ce4
08ce0cc
2975ce4
 
 
 
 
 
 
 
 
 
 
 
 
08ce0cc
 
ce85f05
08ce0cc
ce85f05
2975ce4
ce85f05
2975ce4
 
ce85f05
08ce0cc
 
2975ce4
 
08ce0cc
2975ce4
 
 
 
08ce0cc
2975ce4
08ce0cc
ce85f05
08ce0cc
ce85f05
08ce0cc
ce85f05
2975ce4
ce85f05
2975ce4
 
 
 
ce85f05
08ce0cc
ce85f05
08ce0cc
ce85f05
2975ce4
ce85f05
08ce0cc
ce85f05
08ce0cc
ce85f05
08ce0cc
2975ce4

---
library_name: transformers
datasets:
- Codatta/MM-Food-100K
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-to-text
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

---

# 🥗 NutriScan-3B (MedGemma Edition)

**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it acts as the intelligent "Vision Layer" for AI health pipelines.

It is fine-tuned on **Qwen2.5-VL-3B-Instruct**, bridging the gap between raw culinary images and medical-grade nutritional analysis.

---

### 🚀 Key Features

* **Food Recognition:** Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
* **Ingredient Breakdown:** Detects visible components (e.g., "lentils, cream, cilantro garnish").
* **Structured Output:** Generates clean, parsable **JSON** containing calories, macronutrients, and portion estimates.
* **Efficient:** Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.

---

### 📊 Dataset & Transparency

This model was fine-tuned on the **Codatta/MM-Food-100K** dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:

* **Total Training Images:** **9,281** high-quality samples.
* **Filename Note:** Image filenames (e.g., `food_099996.jpg`) preserve their **original index** from the source dataset.
* *Clarification:* You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.



---

### 🐍 Quick Start

You must install the latest transformers libraries to support Qwen2.5-VL.

```bash
pip install git+https://github.com/huggingface/transformers
pip install peft accelerate bitsandbytes qwen-vl-utils

```

#### **Inference Code**

```python
import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from qwen_vl_utils import process_vision_info

# 1. Load Model & Adapter
base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
adapter_model = "HackerAditya56/NutriScan-3B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    base_model, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28)

# 2. Run Analysis
def scan_food(image_path):
    image = Image.open(image_path).convert("RGB")
    
    # We use a specific prompt to force JSON output
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
        ]
    }]
    
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(
        text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
    ).to("cuda")

    generated_ids = model.generate(**inputs, max_new_tokens=512)
    return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

# Test
print(scan_food("my_lunch.jpg"))

```

---

### 📈 Example Output

**Input:** Image of a pepperoni pizza.
**Model Output:**

```json
{
  "dish_name": "Pepperoni Pizza",
  "ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
  "nutritional_profile": {
    "calories_per_slice": 280,
    "protein": "12g",
    "fat": "10g",
    "carbs": "35g"
  },
  "health_note": "Contains processed meat and high sodium."
}

```

---

### 🔧 Technical Specs

* **Base Architecture:** Qwen2.5-VL (Vision-Language)
* **Fine-Tuning Method:** QLoRA (Rank 16, Alpha 16)
* **Precision:** 4-bit NF4 (Normal Float 4)
* **Training Hardware:** NVIDIA T4 GPUs (Kaggle)

---

### ⚠️ Disclaimer

**Not Medical Advice.** This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.

---

### 👨‍💻 Author

**Aditya Nandan** (HackerAditya56)
*Developed for the MedGemma Hackathon 2026*