---
license: mit
language:
- en
tags:
- vision
- food
- nutrition
- calorie-estimation
- clip
- image-classification
- health
datasets:
- nutrition5k
metrics:
- mae
pipeline_tag: image-to-text
library_name: open-clip
---
# π CalorieCLIP: Accurate Food Calorie Estimation
**CalorieCLIP** is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.
## π― Key Results
| Metric | Value |
|--------|-------|
| **Mean Absolute Error** | **51.4 calories** |
| Within 50 calories | 67.6% |
| Within 100 calories | 90.5% |
| Inference Speed | <50ms on M1 Mac |
## π½οΈ Example Predictions
Real predictions from our validation set across multiple datasets:
| Image | Food | Dataset | Actual | Predicted | Error |
|-------|------|---------|--------|-----------|-------|
|  | Hamburger | Food-101 | 558 | 555 | 3 |
|  | Ramen | Food-101 | 431 | 437 | 6 |
|  | Greek Salad | Food-101 | 144 | 143 | 1 |
|  | Sashimi | Food-101 | 156 | 156 | 0 |
|  | Cafeteria Meal | Nutrition5k | 88 | 88 | 0 |
|  | Cafeteria Meal | Nutrition5k | 138 | 138 | 0 |
|  | Cafeteria Meal | Nutrition5k | 330 | 334 | 3 |
|  | Cafeteria Meal | Nutrition5k | 214 | 217 | 4 |
## π Quick Start
### Installation
```bash
pip install open-clip-torch torch pillow
```
### Python Usage
```python
# Clone or download this repo first, then:
from calorie_clip import CalorieCLIP
# Load model from local directory
model = CalorieCLIP.from_pretrained(".")
# Predict calories
calories = model.predict("food_photo.jpg")
print(f"Estimated: {calories:.0f} calories")
# Batch prediction
images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
results = model.predict_batch(images)
```
### Direct Usage (no wrapper)
```python
import torch
import open_clip
from PIL import Image
# Load CLIP
clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
clip.load_state_dict(checkpoint['clip_state'], strict=False)
# Load regression head
import torch.nn as nn
class RegressionHead(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
)
def forward(self, x): return self.net(x)
head = RegressionHead()
head.load_state_dict(checkpoint['regressor_state'])
clip.eval(); head.eval()
# Predict
img = preprocess(Image.open('food.jpg')).unsqueeze(0)
with torch.no_grad():
features = clip.encode_image(img)
calories = head(features).item()
print(f"{calories:.0f} calories")
```
### Command Line
```bash
python calorie_clip.py my_food_image.jpg
# Output: my_food_image.jpg: 342 calories
```
## π Training Progress
The model was trained for 30 epochs on the Nutrition5k dataset with:
- **Huber Loss** for robustness to outliers
- **Strong augmentation** (rotation, color jitter, flips)
- **Fine-tuning last 2 CLIP transformer blocks** (9.4% of parameters)
- **Differential learning rates** (1e-5 for CLIP, 1e-3 for regression head)
## π¬ Technical Details
### Architecture
```
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Food Image ββββββΆβ CLIP ViT-B ββββββΆβ Regression ββββββΆ Calories
β (224Γ224) β β Encoder β β Head β
βββββββββββββββββββ β (fine-tuned)β β (4 layers) β
ββββββββββββββββ βββββββββββββββ
β
βΌ
512-dim features
```
### Model Specs
- **Base Model**: OpenAI CLIP ViT-B/32
- **Fine-tuned Layers**: Last 2 transformer blocks + regression head
- **Trainable Parameters**: 9.4% (8.5M of 90M)
- **Input Size**: 224Γ224 RGB
- **Output**: Single float (calories)
### Comparison to VLMs
We tested multiple Vision-Language Models on the same test set:
| Model | MAE | Notes |
|-------|-----|-------|
| **CalorieCLIP (Ours)** | **51.4** | Local, fast, accurate |
| Claude 3.5 Sonnet | 71.7 | API required |
| GPT-4o | 80.2 | API required |
| Gemini 1.5 Pro | 86.7 | API required |
| GPT-4o-mini | 88.7 | API required |
| Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues |
**Key Finding**: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.
## π Files
```
CalorieCLIP/
βββ config.json # Model configuration
βββ calorie_clip.pt # Model weights (PyTorch)
βββ calorie_clip.py # Inference code
βββ requirements.txt # Dependencies
βββ assets/
βββ training_progress.png
βββ model_comparison.png
βββ accuracy_breakdown.png
βββ error_distribution.png
```
## π Training Data
Trained on a combined dataset of:
- **[Nutrition5k](https://github.com/google-research-datasets/nutrition5k)**: 5,006 real cafeteria food images with professional calorie measurements
- **Food-101 subset**: 8,000+ food images with estimated calories
- **Total: 13,004 samples** (11,053 train / 1,951 validation)
- **Diverse foods**: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more
## β οΈ Limitations
- Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
- Single-dish focused; complex multi-item plates may have higher error
- Portion size estimation is inherently challenging from 2D images
- Not a replacement for professional nutrition advice
## π Citation
```bibtex
@software{calorieclip2024,
author = {Haplo LLC},
title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
year = {2024},
url = {https://huggingface.co/jc-builds/CalorieCLIP}
}
```
## π License
MIT License - free for commercial and personal use.
---
Made with β€οΈ by Haplo LLC