CalorieCLIP / README.md
Jared
Fix Python usage examples with working code
26b9bd6
---
license: mit
language:
- en
tags:
- vision
- food
- nutrition
- calorie-estimation
- clip
- image-classification
- health
datasets:
- nutrition5k
metrics:
- mae
pipeline_tag: image-to-text
library_name: open-clip
---
# 🍎 CalorieCLIP: Accurate Food Calorie Estimation
<p align="center">
<img src="assets/model_comparison.png" width="700" alt="CalorieCLIP vs Other Models">
</p>
**CalorieCLIP** is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.
## 🎯 Key Results
| Metric | Value |
|--------|-------|
| **Mean Absolute Error** | **51.4 calories** |
| Within 50 calories | 67.6% |
| Within 100 calories | 90.5% |
| Inference Speed | <50ms on M1 Mac |
<p align="center">
<img src="assets/accuracy_breakdown.png" width="500" alt="Accuracy Breakdown">
</p>
## 🍽️ Example Predictions
Real predictions from our validation set across multiple datasets:
| Image | Food | Dataset | Actual | Predicted | Error |
|-------|------|---------|--------|-----------|-------|
| ![Example 1](assets/examples/example_1.png) | Hamburger | Food-101 | 558 | 555 | 3 |
| ![Example 2](assets/examples/example_2.png) | Ramen | Food-101 | 431 | 437 | 6 |
| ![Example 3](assets/examples/example_3.png) | Greek Salad | Food-101 | 144 | 143 | 1 |
| ![Example 4](assets/examples/example_4.png) | Sashimi | Food-101 | 156 | 156 | 0 |
| ![Example 5](assets/examples/example_5.png) | Cafeteria Meal | Nutrition5k | 88 | 88 | 0 |
| ![Example 6](assets/examples/example_6.png) | Cafeteria Meal | Nutrition5k | 138 | 138 | 0 |
| ![Example 7](assets/examples/example_7.png) | Cafeteria Meal | Nutrition5k | 330 | 334 | 3 |
| ![Example 8](assets/examples/example_8.png) | Cafeteria Meal | Nutrition5k | 214 | 217 | 4 |
## πŸš€ Quick Start
### Installation
```bash
pip install open-clip-torch torch pillow
```
### Python Usage
```python
# Clone or download this repo first, then:
from calorie_clip import CalorieCLIP
# Load model from local directory
model = CalorieCLIP.from_pretrained(".")
# Predict calories
calories = model.predict("food_photo.jpg")
print(f"Estimated: {calories:.0f} calories")
# Batch prediction
images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
results = model.predict_batch(images)
```
### Direct Usage (no wrapper)
```python
import torch
import open_clip
from PIL import Image
# Load CLIP
clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
clip.load_state_dict(checkpoint['clip_state'], strict=False)
# Load regression head
import torch.nn as nn
class RegressionHead(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
)
def forward(self, x): return self.net(x)
head = RegressionHead()
head.load_state_dict(checkpoint['regressor_state'])
clip.eval(); head.eval()
# Predict
img = preprocess(Image.open('food.jpg')).unsqueeze(0)
with torch.no_grad():
features = clip.encode_image(img)
calories = head(features).item()
print(f"{calories:.0f} calories")
```
### Command Line
```bash
python calorie_clip.py my_food_image.jpg
# Output: my_food_image.jpg: 342 calories
```
## πŸ“Š Training Progress
<p align="center">
<img src="assets/training_progress.png" width="800" alt="Training Progress">
</p>
The model was trained for 30 epochs on the Nutrition5k dataset with:
- **Huber Loss** for robustness to outliers
- **Strong augmentation** (rotation, color jitter, flips)
- **Fine-tuning last 2 CLIP transformer blocks** (9.4% of parameters)
- **Differential learning rates** (1e-5 for CLIP, 1e-3 for regression head)
## πŸ”¬ Technical Details
### Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Food Image │────▢│ CLIP ViT-B │────▢│ Regression │────▢ Calories
β”‚ (224Γ—224) β”‚ β”‚ Encoder β”‚ β”‚ Head β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (fine-tuned)β”‚ β”‚ (4 layers) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
512-dim features
```
### Model Specs
- **Base Model**: OpenAI CLIP ViT-B/32
- **Fine-tuned Layers**: Last 2 transformer blocks + regression head
- **Trainable Parameters**: 9.4% (8.5M of 90M)
- **Input Size**: 224Γ—224 RGB
- **Output**: Single float (calories)
### Comparison to VLMs
We tested multiple Vision-Language Models on the same test set:
<p align="center">
<img src="assets/error_distribution.png" width="600" alt="Error Distribution">
</p>
| Model | MAE | Notes |
|-------|-----|-------|
| **CalorieCLIP (Ours)** | **51.4** | Local, fast, accurate |
| Claude 3.5 Sonnet | 71.7 | API required |
| GPT-4o | 80.2 | API required |
| Gemini 1.5 Pro | 86.7 | API required |
| GPT-4o-mini | 88.7 | API required |
| Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues |
**Key Finding**: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.
## πŸ“ Files
```
CalorieCLIP/
β”œβ”€β”€ config.json # Model configuration
β”œβ”€β”€ calorie_clip.pt # Model weights (PyTorch)
β”œβ”€β”€ calorie_clip.py # Inference code
β”œβ”€β”€ requirements.txt # Dependencies
└── assets/
β”œβ”€β”€ training_progress.png
β”œβ”€β”€ model_comparison.png
β”œβ”€β”€ accuracy_breakdown.png
└── error_distribution.png
```
## πŸ“‹ Training Data
Trained on a combined dataset of:
- **[Nutrition5k](https://github.com/google-research-datasets/nutrition5k)**: 5,006 real cafeteria food images with professional calorie measurements
- **Food-101 subset**: 8,000+ food images with estimated calories
- **Total: 13,004 samples** (11,053 train / 1,951 validation)
- **Diverse foods**: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more
## ⚠️ Limitations
- Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
- Single-dish focused; complex multi-item plates may have higher error
- Portion size estimation is inherently challenging from 2D images
- Not a replacement for professional nutrition advice
## πŸ™ Citation
```bibtex
@software{calorieclip2024,
author = {Haplo LLC},
title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
year = {2024},
url = {https://huggingface.co/jc-builds/CalorieCLIP}
}
```
## πŸ“„ License
MIT License - free for commercial and personal use.
---
<p align="center">
Made with ❀️ by <a href="https://haploapp.com">Haplo LLC</a>
</p>