File size: 7,270 Bytes

---
license: mit
language:
- en
tags:
- vision
- food
- nutrition
- calorie-estimation
- clip
- image-classification
- health
datasets:
- nutrition5k
metrics:
- mae
pipeline_tag: image-to-text
library_name: open-clip
---

# 🍎 CalorieCLIP: Accurate Food Calorie Estimation

<p align="center">
  <img src="assets/model_comparison.png" width="700" alt="CalorieCLIP vs Other Models">
</p>

**CalorieCLIP** is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.

## 🎯 Key Results

| Metric | Value |
|--------|-------|
| **Mean Absolute Error** | **51.4 calories** |
| Within 50 calories | 67.6% |
| Within 100 calories | 90.5% |
| Inference Speed | <50ms on M1 Mac |

<p align="center">
  <img src="assets/accuracy_breakdown.png" width="500" alt="Accuracy Breakdown">
</p>

## 🍽️ Example Predictions

Real predictions from our validation set across multiple datasets:

| Image | Food | Dataset | Actual | Predicted | Error |
|-------|------|---------|--------|-----------|-------|
| ![Example 1](assets/examples/example_1.png) | Hamburger | Food-101 | 558 | 555 | 3 |
| ![Example 2](assets/examples/example_2.png) | Ramen | Food-101 | 431 | 437 | 6 |
| ![Example 3](assets/examples/example_3.png) | Greek Salad | Food-101 | 144 | 143 | 1 |
| ![Example 4](assets/examples/example_4.png) | Sashimi | Food-101 | 156 | 156 | 0 |
| ![Example 5](assets/examples/example_5.png) | Cafeteria Meal | Nutrition5k | 88 | 88 | 0 |
| ![Example 6](assets/examples/example_6.png) | Cafeteria Meal | Nutrition5k | 138 | 138 | 0 |
| ![Example 7](assets/examples/example_7.png) | Cafeteria Meal | Nutrition5k | 330 | 334 | 3 |
| ![Example 8](assets/examples/example_8.png) | Cafeteria Meal | Nutrition5k | 214 | 217 | 4 |

## 🚀 Quick Start

### Installation

```bash
pip install open-clip-torch torch pillow
```

### Python Usage

```python
# Clone or download this repo first, then:
from calorie_clip import CalorieCLIP

# Load model from local directory
model = CalorieCLIP.from_pretrained(".")

# Predict calories
calories = model.predict("food_photo.jpg")
print(f"Estimated: {calories:.0f} calories")

# Batch prediction
images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
results = model.predict_batch(images)
```

### Direct Usage (no wrapper)

```python
import torch
import open_clip
from PIL import Image

# Load CLIP
clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
clip.load_state_dict(checkpoint['clip_state'], strict=False)

# Load regression head
import torch.nn as nn
class RegressionHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
            nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
        )
    def forward(self, x): return self.net(x)

head = RegressionHead()
head.load_state_dict(checkpoint['regressor_state'])
clip.eval(); head.eval()

# Predict
img = preprocess(Image.open('food.jpg')).unsqueeze(0)
with torch.no_grad():
    features = clip.encode_image(img)
    calories = head(features).item()
print(f"{calories:.0f} calories")
```

### Command Line

```bash
python calorie_clip.py my_food_image.jpg
# Output: my_food_image.jpg: 342 calories
```

## 📊 Training Progress

<p align="center">
  <img src="assets/training_progress.png" width="800" alt="Training Progress">
</p>

The model was trained for 30 epochs on the Nutrition5k dataset with:
- **Huber Loss** for robustness to outliers
- **Strong augmentation** (rotation, color jitter, flips)
- **Fine-tuning last 2 CLIP transformer blocks** (9.4% of parameters)
- **Differential learning rates** (1e-5 for CLIP, 1e-3 for regression head)

## 🔬 Technical Details

### Architecture

```
┌─────────────────┐     ┌──────────────┐     ┌─────────────┐
│   Food Image    │────▶│  CLIP ViT-B  │────▶│  Regression │────▶ Calories
│   (224×224)     │     │   Encoder    │     │    Head     │
└─────────────────┘     │  (fine-tuned)│     │  (4 layers) │
                        └──────────────┘     └─────────────┘
                              │
                              ▼
                        512-dim features
```

### Model Specs

- **Base Model**: OpenAI CLIP ViT-B/32
- **Fine-tuned Layers**: Last 2 transformer blocks + regression head
- **Trainable Parameters**: 9.4% (8.5M of 90M)
- **Input Size**: 224×224 RGB
- **Output**: Single float (calories)

### Comparison to VLMs

We tested multiple Vision-Language Models on the same test set:

<p align="center">
  <img src="assets/error_distribution.png" width="600" alt="Error Distribution">
</p>

| Model | MAE | Notes |
|-------|-----|-------|
| **CalorieCLIP (Ours)** | **51.4** | Local, fast, accurate |
| Claude 3.5 Sonnet | 71.7 | API required |
| GPT-4o | 80.2 | API required |
| Gemini 1.5 Pro | 86.7 | API required |
| GPT-4o-mini | 88.7 | API required |
| Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues |

**Key Finding**: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.

## 📁 Files

```
CalorieCLIP/
├── config.json           # Model configuration
├── calorie_clip.pt       # Model weights (PyTorch)
├── calorie_clip.py       # Inference code
├── requirements.txt      # Dependencies
└── assets/
    ├── training_progress.png
    ├── model_comparison.png
    ├── accuracy_breakdown.png
    └── error_distribution.png
```

## 📋 Training Data

Trained on a combined dataset of:
- **[Nutrition5k](https://github.com/google-research-datasets/nutrition5k)**: 5,006 real cafeteria food images with professional calorie measurements
- **Food-101 subset**: 8,000+ food images with estimated calories
- **Total: 13,004 samples** (11,053 train / 1,951 validation)
- **Diverse foods**: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more

## ⚠️ Limitations

- Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
- Single-dish focused; complex multi-item plates may have higher error
- Portion size estimation is inherently challenging from 2D images
- Not a replacement for professional nutrition advice

## 🙏 Citation

```bibtex
@software{calorieclip2024,
  author = {Haplo LLC},
  title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
  year = {2024},
  url = {https://huggingface.co/jc-builds/CalorieCLIP}
}
```

## 📄 License

MIT License - free for commercial and personal use.

---

<p align="center">
  Made with ❤️ by <a href="https://haploapp.com">Haplo LLC</a>
</p>