File size: 7,270 Bytes
f5997ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a9cee
 
 
 
f5997ce
 
 
 
 
f8a9cee
 
d5ebf0b
f8a9cee
d5ebf0b
 
 
 
 
 
 
 
 
 
f8a9cee
f5997ce
 
 
 
 
 
 
 
 
 
 
26b9bd6
f5997ce
 
26b9bd6
 
f5997ce
 
 
 
 
 
 
 
 
 
26b9bd6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5997ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a9cee
f5997ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a9cee
f5997ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a9cee
 
 
 
 
f5997ce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8a9cee
f5997ce
 
 
 
 
 
 
 
 
 
36ec110
f5997ce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
license: mit
language:
- en
tags:
- vision
- food
- nutrition
- calorie-estimation
- clip
- image-classification
- health
datasets:
- nutrition5k
metrics:
- mae
pipeline_tag: image-to-text
library_name: open-clip
---

# 🍎 CalorieCLIP: Accurate Food Calorie Estimation

<p align="center">
  <img src="assets/model_comparison.png" width="700" alt="CalorieCLIP vs Other Models">
</p>

**CalorieCLIP** is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.

## 🎯 Key Results

| Metric | Value |
|--------|-------|
| **Mean Absolute Error** | **51.4 calories** |
| Within 50 calories | 67.6% |
| Within 100 calories | 90.5% |
| Inference Speed | <50ms on M1 Mac |

<p align="center">
  <img src="assets/accuracy_breakdown.png" width="500" alt="Accuracy Breakdown">
</p>

## 🍽️ Example Predictions

Real predictions from our validation set across multiple datasets:

| Image | Food | Dataset | Actual | Predicted | Error |
|-------|------|---------|--------|-----------|-------|
| ![Example 1](assets/examples/example_1.png) | Hamburger | Food-101 | 558 | 555 | 3 |
| ![Example 2](assets/examples/example_2.png) | Ramen | Food-101 | 431 | 437 | 6 |
| ![Example 3](assets/examples/example_3.png) | Greek Salad | Food-101 | 144 | 143 | 1 |
| ![Example 4](assets/examples/example_4.png) | Sashimi | Food-101 | 156 | 156 | 0 |
| ![Example 5](assets/examples/example_5.png) | Cafeteria Meal | Nutrition5k | 88 | 88 | 0 |
| ![Example 6](assets/examples/example_6.png) | Cafeteria Meal | Nutrition5k | 138 | 138 | 0 |
| ![Example 7](assets/examples/example_7.png) | Cafeteria Meal | Nutrition5k | 330 | 334 | 3 |
| ![Example 8](assets/examples/example_8.png) | Cafeteria Meal | Nutrition5k | 214 | 217 | 4 |

## πŸš€ Quick Start

### Installation

```bash
pip install open-clip-torch torch pillow
```

### Python Usage

```python
# Clone or download this repo first, then:
from calorie_clip import CalorieCLIP

# Load model from local directory
model = CalorieCLIP.from_pretrained(".")

# Predict calories
calories = model.predict("food_photo.jpg")
print(f"Estimated: {calories:.0f} calories")

# Batch prediction
images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
results = model.predict_batch(images)
```

### Direct Usage (no wrapper)

```python
import torch
import open_clip
from PIL import Image

# Load CLIP
clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
clip.load_state_dict(checkpoint['clip_state'], strict=False)

# Load regression head
import torch.nn as nn
class RegressionHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
            nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
        )
    def forward(self, x): return self.net(x)

head = RegressionHead()
head.load_state_dict(checkpoint['regressor_state'])
clip.eval(); head.eval()

# Predict
img = preprocess(Image.open('food.jpg')).unsqueeze(0)
with torch.no_grad():
    features = clip.encode_image(img)
    calories = head(features).item()
print(f"{calories:.0f} calories")
```

### Command Line

```bash
python calorie_clip.py my_food_image.jpg
# Output: my_food_image.jpg: 342 calories
```

## πŸ“Š Training Progress

<p align="center">
  <img src="assets/training_progress.png" width="800" alt="Training Progress">
</p>

The model was trained for 30 epochs on the Nutrition5k dataset with:
- **Huber Loss** for robustness to outliers
- **Strong augmentation** (rotation, color jitter, flips)
- **Fine-tuning last 2 CLIP transformer blocks** (9.4% of parameters)
- **Differential learning rates** (1e-5 for CLIP, 1e-3 for regression head)

## πŸ”¬ Technical Details

### Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Food Image    │────▢│  CLIP ViT-B  │────▢│  Regression │────▢ Calories
β”‚   (224Γ—224)     β”‚     β”‚   Encoder    β”‚     β”‚    Head     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  (fine-tuned)β”‚     β”‚  (4 layers) β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                        512-dim features
```

### Model Specs

- **Base Model**: OpenAI CLIP ViT-B/32
- **Fine-tuned Layers**: Last 2 transformer blocks + regression head
- **Trainable Parameters**: 9.4% (8.5M of 90M)
- **Input Size**: 224Γ—224 RGB
- **Output**: Single float (calories)

### Comparison to VLMs

We tested multiple Vision-Language Models on the same test set:

<p align="center">
  <img src="assets/error_distribution.png" width="600" alt="Error Distribution">
</p>

| Model | MAE | Notes |
|-------|-----|-------|
| **CalorieCLIP (Ours)** | **51.4** | Local, fast, accurate |
| Claude 3.5 Sonnet | 71.7 | API required |
| GPT-4o | 80.2 | API required |
| Gemini 1.5 Pro | 86.7 | API required |
| GPT-4o-mini | 88.7 | API required |
| Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues |

**Key Finding**: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.

## πŸ“ Files

```
CalorieCLIP/
β”œβ”€β”€ config.json           # Model configuration
β”œβ”€β”€ calorie_clip.pt       # Model weights (PyTorch)
β”œβ”€β”€ calorie_clip.py       # Inference code
β”œβ”€β”€ requirements.txt      # Dependencies
└── assets/
    β”œβ”€β”€ training_progress.png
    β”œβ”€β”€ model_comparison.png
    β”œβ”€β”€ accuracy_breakdown.png
    └── error_distribution.png
```

## πŸ“‹ Training Data

Trained on a combined dataset of:
- **[Nutrition5k](https://github.com/google-research-datasets/nutrition5k)**: 5,006 real cafeteria food images with professional calorie measurements
- **Food-101 subset**: 8,000+ food images with estimated calories
- **Total: 13,004 samples** (11,053 train / 1,951 validation)
- **Diverse foods**: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more

## ⚠️ Limitations

- Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
- Single-dish focused; complex multi-item plates may have higher error
- Portion size estimation is inherently challenging from 2D images
- Not a replacement for professional nutrition advice

## πŸ™ Citation

```bibtex
@software{calorieclip2024,
  author = {Haplo LLC},
  title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
  year = {2024},
  url = {https://huggingface.co/jc-builds/CalorieCLIP}
}
```

## πŸ“„ License

MIT License - free for commercial and personal use.

---

<p align="center">
  Made with ❀️ by <a href="https://haploapp.com">Haplo LLC</a>
</p>