Update README.md
Browse files
README.md
CHANGED
|
@@ -7,193 +7,136 @@ tags: []
|
|
| 7 |
|
| 8 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
|
|
|
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
### Model Description
|
| 15 |
-
|
| 16 |
-
<!-- Provide a longer summary of what this model is. -->
|
| 17 |
-
|
| 18 |
-
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
| 19 |
-
|
| 20 |
-
- **Developed by:** [More Information Needed]
|
| 21 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 22 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
-
|
| 28 |
-
### Model Sources [optional]
|
| 29 |
-
|
| 30 |
-
<!-- Provide the basic links for the model. -->
|
| 31 |
-
|
| 32 |
-
- **Repository:** [More Information Needed]
|
| 33 |
-
- **Paper [optional]:** [More Information Needed]
|
| 34 |
-
- **Demo [optional]:** [More Information Needed]
|
| 35 |
-
|
| 36 |
-
## Uses
|
| 37 |
-
|
| 38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 39 |
-
|
| 40 |
-
### Direct Use
|
| 41 |
-
|
| 42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 43 |
-
|
| 44 |
-
[More Information Needed]
|
| 45 |
-
|
| 46 |
-
### Downstream Use [optional]
|
| 47 |
-
|
| 48 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 49 |
-
|
| 50 |
-
[More Information Needed]
|
| 51 |
-
|
| 52 |
-
### Out-of-Scope Use
|
| 53 |
-
|
| 54 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
| 55 |
-
|
| 56 |
-
[More Information Needed]
|
| 57 |
-
|
| 58 |
-
## Bias, Risks, and Limitations
|
| 59 |
-
|
| 60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 61 |
-
|
| 62 |
-
[More Information Needed]
|
| 63 |
-
|
| 64 |
-
### Recommendations
|
| 65 |
-
|
| 66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 67 |
-
|
| 68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 69 |
-
|
| 70 |
-
## How to Get Started with the Model
|
| 71 |
-
|
| 72 |
-
Use the code below to get started with the model.
|
| 73 |
-
|
| 74 |
-
[More Information Needed]
|
| 75 |
-
|
| 76 |
-
## Training Details
|
| 77 |
-
|
| 78 |
-
### Training Data
|
| 79 |
-
|
| 80 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 81 |
-
|
| 82 |
-
[More Information Needed]
|
| 83 |
-
|
| 84 |
-
### Training Procedure
|
| 85 |
-
|
| 86 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
| 87 |
-
|
| 88 |
-
#### Preprocessing [optional]
|
| 89 |
-
|
| 90 |
-
[More Information Needed]
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
#### Training Hyperparameters
|
| 94 |
-
|
| 95 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 96 |
-
|
| 97 |
-
#### Speeds, Sizes, Times [optional]
|
| 98 |
-
|
| 99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
| 100 |
-
|
| 101 |
-
[More Information Needed]
|
| 102 |
-
|
| 103 |
-
## Evaluation
|
| 104 |
-
|
| 105 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 106 |
-
|
| 107 |
-
### Testing Data, Factors & Metrics
|
| 108 |
-
|
| 109 |
-
#### Testing Data
|
| 110 |
-
|
| 111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
| 112 |
-
|
| 113 |
-
[More Information Needed]
|
| 114 |
-
|
| 115 |
-
#### Factors
|
| 116 |
-
|
| 117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
-
|
| 121 |
-
#### Metrics
|
| 122 |
-
|
| 123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
-
|
| 127 |
-
### Results
|
| 128 |
-
|
| 129 |
-
[More Information Needed]
|
| 130 |
-
|
| 131 |
-
#### Summary
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
-
|
| 141 |
-
## Environmental Impact
|
| 142 |
-
|
| 143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 144 |
-
|
| 145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 146 |
|
| 147 |
-
- **
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
-
###
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
|
|
|
|
| 158 |
|
| 159 |
-
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
|
| 164 |
|
| 165 |
-
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
|
| 168 |
|
| 169 |
-
|
| 170 |
|
| 171 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
-
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
-
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
| 198 |
|
| 199 |
-
|
|
|
|
|
|
| 7 |
|
| 8 |
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
|
| 10 |
+
---
|
| 11 |
|
| 12 |
+
# π₯ NutriScan-3B (MedGemma Edition)
|
| 13 |
|
| 14 |
+
**NutriScan-3B** is a specialized Vision-Language Model (VLM) designed to analyze food images and output structured nutritional data. Built for the **MedGemma Impact Challenge**, it serves as the "Vision Layer" in an AI health agent pipeline.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
+
It is fine-tuned on **Qwen2.5-VL-3B-Instruct** using the **Codatta/MM-Food-100K** dataset to act as a bridge between raw food images and medical analysis.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
---
|
| 19 |
|
| 20 |
+
### π Key Features
|
| 21 |
|
| 22 |
+
* **Food Recognition:** Identifies complex dishes (e.g., "Fried Chicken", "Paneer Butter Masala").
|
| 23 |
+
* **Ingredient Breakdown:** Detects visible ingredients (e.g., "chicken, oil, breading").
|
| 24 |
+
* **Structured Output:** Generates clean **JSON** containing calories, protein, fat, and carbs.
|
| 25 |
+
* **Lightweight:** Runs on consumer GPUs (Colab T4, RTX 3050) using 4-bit quantization.
|
| 26 |
|
| 27 |
+
---
|
| 28 |
|
| 29 |
+
### π¦ Installation
|
| 30 |
|
| 31 |
+
To run NutriScan, you need the latest versions of the Hugging Face libraries (as Qwen2.5-VL is very new).
|
| 32 |
|
| 33 |
+
```bash
|
| 34 |
+
pip install git+https://github.com/huggingface/transformers
|
| 35 |
+
pip install peft accelerate bitsandbytes qwen-vl-utils
|
| 36 |
|
| 37 |
+
```
|
| 38 |
|
| 39 |
+
---
|
| 40 |
|
| 41 |
+
### π Quick Start (Python)
|
| 42 |
+
|
| 43 |
+
Here is the easiest way to run the model on your own images.
|
| 44 |
+
|
| 45 |
+
```python
|
| 46 |
+
import torch
|
| 47 |
+
from PIL import Image
|
| 48 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
| 49 |
+
from peft import PeftModel
|
| 50 |
+
from qwen_vl_utils import process_vision_info
|
| 51 |
+
|
| 52 |
+
# 1. Load the Model
|
| 53 |
+
base_model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
|
| 54 |
+
adapter_id = "HackerAditya56/NutriScan-3B"
|
| 55 |
+
|
| 56 |
+
print("Loading NutriScan...")
|
| 57 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
| 58 |
+
base_model_id,
|
| 59 |
+
torch_dtype=torch.float16,
|
| 60 |
+
device_map="auto",
|
| 61 |
+
)
|
| 62 |
+
# Load the Fine-Tuned Adapter
|
| 63 |
+
model = PeftModel.from_pretrained(model, adapter_id)
|
| 64 |
+
processor = AutoProcessor.from_pretrained(base_model_id, min_pixels=256*28*28, max_pixels=1024*28*28)
|
| 65 |
+
|
| 66 |
+
# 2. Prepare Image
|
| 67 |
+
image_path = "your_food_image.jpg" # Replace with your image
|
| 68 |
+
image = Image.open(image_path).convert("RGB")
|
| 69 |
+
|
| 70 |
+
# 3. Run Inference
|
| 71 |
+
messages = [
|
| 72 |
+
{
|
| 73 |
+
"role": "user",
|
| 74 |
+
"content": [
|
| 75 |
+
{"type": "image", "image": image},
|
| 76 |
+
{"type": "text", "text": "Analyze this food image. Identify the dish, ingredients, and nutritional profile in JSON."}
|
| 77 |
+
]
|
| 78 |
+
}
|
| 79 |
+
]
|
| 80 |
+
|
| 81 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 82 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
| 83 |
+
inputs = processor(
|
| 84 |
+
text=[text],
|
| 85 |
+
images=image_inputs,
|
| 86 |
+
videos=video_inputs,
|
| 87 |
+
padding=True,
|
| 88 |
+
return_tensors="pt",
|
| 89 |
+
).to("cuda")
|
| 90 |
+
|
| 91 |
+
generated_ids = model.generate(**inputs, max_new_tokens=512)
|
| 92 |
+
output_text = processor.batch_decode(
|
| 93 |
+
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
| 94 |
+
)[0]
|
| 95 |
+
|
| 96 |
+
print(output_text)
|
| 97 |
+
|
| 98 |
+
```
|
| 99 |
|
| 100 |
+
---
|
| 101 |
|
| 102 |
+
### π Example Output
|
| 103 |
|
| 104 |
+
When you feed the model an image of a **Cheeseburger**, it outputs structured JSON like this:
|
| 105 |
|
| 106 |
+
```json
|
| 107 |
+
{
|
| 108 |
+
"dish_name": "Cheeseburger",
|
| 109 |
+
"ingredients": ["beef patty", "cheese slice", "lettuce", "tomato", "sesame bun", "sauce"],
|
| 110 |
+
"nutritional_profile": {
|
| 111 |
+
"calories": 550,
|
| 112 |
+
"protein_g": 30,
|
| 113 |
+
"fat_g": 35,
|
| 114 |
+
"carbohydrate_g": 45
|
| 115 |
+
},
|
| 116 |
+
"portion_estimate": "1 burger (approx 250g)"
|
| 117 |
+
}
|
| 118 |
|
| 119 |
+
```
|
| 120 |
|
| 121 |
+
---
|
| 122 |
|
| 123 |
+
### π§ Training Details
|
| 124 |
|
| 125 |
+
* **Base Model:** Qwen2.5-VL-3B-Instruct
|
| 126 |
+
* **Dataset:** Subset of Codatta/MM-Food-100K (~10,000 high-quality samples)
|
| 127 |
+
* **Hardware:** Trained on NVIDIA T4 x 2 (Kaggle)
|
| 128 |
+
* **Technique:** QLoRA (4-bit quantization) with Rank 16 / Alpha 16.
|
| 129 |
+
* **Objective:** The model was trained to ignore chatty conversation and focus strictly on visual recognition and JSON formatting.
|
| 130 |
|
| 131 |
+
---
|
| 132 |
|
| 133 |
+
### β οΈ Disclaimer
|
| 134 |
|
| 135 |
+
**Not Medical Advice:** This model provides nutritional estimates based on visual data. It cannot "see" hidden ingredients (like sugar or salt content) or exact cooking oils. Please use these values as rough guidelines, not medical facts.
|
| 136 |
|
| 137 |
+
---
|
| 138 |
|
| 139 |
+
### π¨βπ» Author
|
| 140 |
|
| 141 |
+
**Aditya Nandan** (HackerAditya56)
|
| 142 |
+
*Built for the MedGemma Hackathon 2026*
|