HackerAditya56 commited on
Commit
2975ce4
Β·
verified Β·
1 Parent(s): 909aa01

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -68
README.md CHANGED
@@ -17,24 +17,36 @@ pipeline_tag: image-to-text
17
 
18
  # πŸ₯— NutriScan-3B (MedGemma Edition)
19
 
20
- **NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it serves as the "Vision Layer" in an AI health agent pipeline.
21
 
22
- It is fine-tuned on **Qwen2.5-VL-3B-Instruct** using the **Codatta/MM-Food-100K** dataset to act as a bridge between raw food images and medical analysis.
23
 
24
  ---
25
 
26
  ### πŸš€ Key Features
27
 
28
- * **Food Recognition:** Identifies complex dishes (e.g., "Fried Chicken", "Paneer Butter Masala").
29
- * **Ingredient Breakdown:** Detects visible ingredients (e.g., "chicken, oil, breading").
30
- * **Structured Output:** Generates clean **JSON** containing calories, protein, fat, and carbs.
31
- * **Lightweight:** Runs on consumer GPUs (Colab T4, RTX 3050) using 4-bit quantization.
32
 
33
  ---
34
 
35
- ### πŸ“¦ Installation
36
 
37
- To run NutriScan, you need the latest versions of the Hugging Face libraries (as Qwen2.5-VL is very new).
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ```bash
40
  pip install git+https://github.com/huggingface/transformers
@@ -42,11 +54,7 @@ pip install peft accelerate bitsandbytes qwen-vl-utils
42
 
43
  ```
44
 
45
- ---
46
-
47
- ### 🐍 Quick Start (Python)
48
-
49
- Here is the easiest way to run the model on your own images.
50
 
51
  ```python
52
  import torch
@@ -55,94 +63,83 @@ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
55
  from peft import PeftModel
56
  from qwen_vl_utils import process_vision_info
57
 
58
- # 1. Load the Model
59
- base_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
60
- adapter_id = "HackerAditya56/NutriScan-3B"
61
 
62
- print("Loading NutriScan...")
63
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
64
- base_model_id,
65
- torch_dtype=torch.float16,
66
- device_map="auto",
67
  )
68
- # Load the Fine-Tuned Adapter
69
- model = PeftModel.from_pretrained(model, adapter_id)
70
- processor = AutoProcessor.from_pretrained(base_model_id, min_pixels=256*28*28, max_pixels=1024*28*28)
71
-
72
- # 2. Prepare Image
73
- image_path = "your_food_image.jpg" # Replace with your image
74
- image = Image.open(image_path).convert("RGB")
75
-
76
- # 3. Run Inference
77
- messages = [
78
- {
79
  "role": "user",
80
  "content": [
81
  {"type": "image", "image": image},
82
- {"type": "text", "text": "Analyze this food image. Identify the dish, ingredients, and nutritional profile in JSON."}
83
  ]
84
- }
85
- ]
86
-
87
- text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
88
- image_inputs, video_inputs = process_vision_info(messages)
89
- inputs = processor(
90
- text=[text],
91
- images=image_inputs,
92
- videos=video_inputs,
93
- padding=True,
94
- return_tensors="pt",
95
- ).to("cuda")
96
-
97
- generated_ids = model.generate(**inputs, max_new_tokens=512)
98
- output_text = processor.batch_decode(
99
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
100
- )[0]
101
-
102
- print(output_text)
103
 
104
  ```
105
 
106
  ---
107
 
108
- ### πŸ“Š Example Output
109
 
110
- When you feed the model an image of a **Cheeseburger**, it outputs structured JSON like this:
 
111
 
112
  ```json
113
  {
114
- "dish_name": "Cheeseburger",
115
- "ingredients": ["beef patty", "cheese slice", "lettuce", "tomato", "sesame bun", "sauce"],
116
  "nutritional_profile": {
117
- "calories": 550,
118
- "protein_g": 30,
119
- "fat_g": 35,
120
- "carbohydrate_g": 45
121
  },
122
- "portion_estimate": "1 burger (approx 250g)"
123
  }
124
 
125
  ```
126
 
127
  ---
128
 
129
- ### πŸ”§ Training Details
130
 
131
- * **Base Model:** Qwen2.5-VL-3B-Instruct
132
- * **Dataset:** Subset of Codatta/MM-Food-100K (~10,000 high-quality samples)
133
- * **Hardware:** Trained on NVIDIA T4 x 2 (Kaggle)
134
- * **Technique:** QLoRA (4-bit quantization) with Rank 16 / Alpha 16.
135
- * **Objective:** The model was trained to ignore chatty conversation and focus strictly on visual recognition and JSON formatting.
136
 
137
  ---
138
 
139
  ### ⚠️ Disclaimer
140
 
141
- **Not Medical Advice:** This model provides nutritional estimates based on visual data. It cannot "see" hidden ingredients (like sugar or salt content) or exact cooking oils. Please use these values as rough guidelines, not medical facts.
142
 
143
  ---
144
 
145
  ### πŸ‘¨β€πŸ’» Author
146
 
147
  **Aditya Nandan** (HackerAditya56)
148
- *Built for the MedGemma Hackathon 2026*
 
17
 
18
  # πŸ₯— NutriScan-3B (MedGemma Edition)
19
 
20
+ **NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it acts as the intelligent "Vision Layer" for AI health pipelines.
21
 
22
+ It is fine-tuned on **Qwen2.5-VL-3B-Instruct**, bridging the gap between raw culinary images and medical-grade nutritional analysis.
23
 
24
  ---
25
 
26
  ### πŸš€ Key Features
27
 
28
+ * **Food Recognition:** Identifies specific dishes (e.g., "Cheeseburger") rather than generic labels.
29
+ * **Ingredient Breakdown:** Detects visible components (e.g., "lentils, cream, cilantro garnish").
30
+ * **Structured Output:** Generates clean, parsable **JSON** containing calories, macronutrients, and portion estimates.
31
+ * **Efficient:** Optimized for consumer hardware (Runs on T4/RTX 3050) using 4-bit quantization.
32
 
33
  ---
34
 
35
+ ### πŸ“Š Dataset & Transparency
36
 
37
+ This model was fine-tuned on the **Codatta/MM-Food-100K** dataset. To ensure high data quality and download reliability during the hackathon, we curated a specific subset:
38
+
39
+ * **Total Training Images:** **9,281** high-quality samples.
40
+ * **Filename Note:** Image filenames (e.g., `food_099996.jpg`) preserve their **original index** from the source dataset.
41
+ * *Clarification:* You may see filenames with high numbers (like 99k) despite the dataset size being ~9.2k. This is normal and represents the original Global ID of the image, not a missing file error.
42
+
43
+
44
+
45
+ ---
46
+
47
+ ### 🐍 Quick Start
48
+
49
+ You must install the latest transformers libraries to support Qwen2.5-VL.
50
 
51
  ```bash
52
  pip install git+https://github.com/huggingface/transformers
 
54
 
55
  ```
56
 
57
+ #### **Inference Code**
 
 
 
 
58
 
59
  ```python
60
  import torch
 
63
  from peft import PeftModel
64
  from qwen_vl_utils import process_vision_info
65
 
66
+ # 1. Load Model & Adapter
67
+ base_model = "Qwen/Qwen2.5-VL-3B-Instruct"
68
+ adapter_model = "HackerAditya56/NutriScan-3B"
69
 
 
70
  model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
71
+ base_model, torch_dtype=torch.float16, device_map="auto"
 
 
72
  )
73
+ model = PeftModel.from_pretrained(model, adapter_model)
74
+ processor = AutoProcessor.from_pretrained(base_model, min_pixels=256*28*28, max_pixels=1024*28*28)
75
+
76
+ # 2. Run Analysis
77
+ def scan_food(image_path):
78
+ image = Image.open(image_path).convert("RGB")
79
+
80
+ # We use a specific prompt to force JSON output
81
+ messages = [{
 
 
82
  "role": "user",
83
  "content": [
84
  {"type": "image", "image": image},
85
+ {"type": "text", "text": "You are a nutritionist. Identify this dish, list ingredients, and estimate nutrition in JSON format."}
86
  ]
87
+ }]
88
+
89
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
90
+ image_inputs, video_inputs = process_vision_info(messages)
91
+ inputs = processor(
92
+ text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt"
93
+ ).to("cuda")
94
+
95
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
96
+ return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
97
+
98
+ # Test
99
+ print(scan_food("my_lunch.jpg"))
 
 
 
 
 
 
100
 
101
  ```
102
 
103
  ---
104
 
105
+ ### πŸ“ˆ Example Output
106
 
107
+ **Input:** Image of a pepperoni pizza.
108
+ **Model Output:**
109
 
110
  ```json
111
  {
112
+ "dish_name": "Pepperoni Pizza",
113
+ "ingredients": ["pizza dough", "tomato sauce", "mozzarella cheese", "pepperoni slices", "oregano"],
114
  "nutritional_profile": {
115
+ "calories_per_slice": 280,
116
+ "protein": "12g",
117
+ "fat": "10g",
118
+ "carbs": "35g"
119
  },
120
+ "health_note": "Contains processed meat and high sodium."
121
  }
122
 
123
  ```
124
 
125
  ---
126
 
127
+ ### πŸ”§ Technical Specs
128
 
129
+ * **Base Architecture:** Qwen2.5-VL (Vision-Language)
130
+ * **Fine-Tuning Method:** QLoRA (Rank 16, Alpha 16)
131
+ * **Precision:** 4-bit NF4 (Normal Float 4)
132
+ * **Training Hardware:** NVIDIA T4 GPUs (Kaggle)
 
133
 
134
  ---
135
 
136
  ### ⚠️ Disclaimer
137
 
138
+ **Not Medical Advice.** This AI estimates nutrition based on visual features. It cannot detect hidden ingredients (sugar, salt, oils) or allergens with 100% accuracy. Use for educational and tracking purposes only.
139
 
140
  ---
141
 
142
  ### πŸ‘¨β€πŸ’» Author
143
 
144
  **Aditya Nandan** (HackerAditya56)
145
+ *Developed for the MedGemma Hackathon 2026*