--- license: mit language: - en tags: - vision - food - nutrition - calorie-estimation - clip - image-classification - health datasets: - nutrition5k metrics: - mae pipeline_tag: image-to-text library_name: open-clip --- # 🍎 CalorieCLIP: Accurate Food Calorie Estimation

CalorieCLIP vs Other Models

**CalorieCLIP** is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device. ## 🎯 Key Results | Metric | Value | |--------|-------| | **Mean Absolute Error** | **51.4 calories** | | Within 50 calories | 67.6% | | Within 100 calories | 90.5% | | Inference Speed | <50ms on M1 Mac |

Accuracy Breakdown

## 🍽️ Example Predictions Real predictions from our validation set across multiple datasets: | Image | Food | Dataset | Actual | Predicted | Error | |-------|------|---------|--------|-----------|-------| | ![Example 1](assets/examples/example_1.png) | Hamburger | Food-101 | 558 | 555 | 3 | | ![Example 2](assets/examples/example_2.png) | Ramen | Food-101 | 431 | 437 | 6 | | ![Example 3](assets/examples/example_3.png) | Greek Salad | Food-101 | 144 | 143 | 1 | | ![Example 4](assets/examples/example_4.png) | Sashimi | Food-101 | 156 | 156 | 0 | | ![Example 5](assets/examples/example_5.png) | Cafeteria Meal | Nutrition5k | 88 | 88 | 0 | | ![Example 6](assets/examples/example_6.png) | Cafeteria Meal | Nutrition5k | 138 | 138 | 0 | | ![Example 7](assets/examples/example_7.png) | Cafeteria Meal | Nutrition5k | 330 | 334 | 3 | | ![Example 8](assets/examples/example_8.png) | Cafeteria Meal | Nutrition5k | 214 | 217 | 4 | ## πŸš€ Quick Start ### Installation ```bash pip install open-clip-torch torch pillow ``` ### Python Usage ```python # Clone or download this repo first, then: from calorie_clip import CalorieCLIP # Load model from local directory model = CalorieCLIP.from_pretrained(".") # Predict calories calories = model.predict("food_photo.jpg") print(f"Estimated: {calories:.0f} calories") # Batch prediction images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"] results = model.predict_batch(images) ``` ### Direct Usage (no wrapper) ```python import torch import open_clip from PIL import Image # Load CLIP clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai') checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False) clip.load_state_dict(checkpoint['clip_state'], strict=False) # Load regression head import torch.nn as nn class RegressionHead(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4), nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3), nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1) ) def forward(self, x): return self.net(x) head = RegressionHead() head.load_state_dict(checkpoint['regressor_state']) clip.eval(); head.eval() # Predict img = preprocess(Image.open('food.jpg')).unsqueeze(0) with torch.no_grad(): features = clip.encode_image(img) calories = head(features).item() print(f"{calories:.0f} calories") ``` ### Command Line ```bash python calorie_clip.py my_food_image.jpg # Output: my_food_image.jpg: 342 calories ``` ## πŸ“Š Training Progress

Training Progress

The model was trained for 30 epochs on the Nutrition5k dataset with: - **Huber Loss** for robustness to outliers - **Strong augmentation** (rotation, color jitter, flips) - **Fine-tuning last 2 CLIP transformer blocks** (9.4% of parameters) - **Differential learning rates** (1e-5 for CLIP, 1e-3 for regression head) ## πŸ”¬ Technical Details ### Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Food Image │────▢│ CLIP ViT-B │────▢│ Regression │────▢ Calories β”‚ (224Γ—224) β”‚ β”‚ Encoder β”‚ β”‚ Head β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (fine-tuned)β”‚ β”‚ (4 layers) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό 512-dim features ``` ### Model Specs - **Base Model**: OpenAI CLIP ViT-B/32 - **Fine-tuned Layers**: Last 2 transformer blocks + regression head - **Trainable Parameters**: 9.4% (8.5M of 90M) - **Input Size**: 224Γ—224 RGB - **Output**: Single float (calories) ### Comparison to VLMs We tested multiple Vision-Language Models on the same test set:

Error Distribution

| Model | MAE | Notes | |-------|-----|-------| | **CalorieCLIP (Ours)** | **51.4** | Local, fast, accurate | | Claude 3.5 Sonnet | 71.7 | API required | | GPT-4o | 80.2 | API required | | Gemini 1.5 Pro | 86.7 | API required | | GPT-4o-mini | 88.7 | API required | | Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues | **Key Finding**: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely. ## πŸ“ Files ``` CalorieCLIP/ β”œβ”€β”€ config.json # Model configuration β”œβ”€β”€ calorie_clip.pt # Model weights (PyTorch) β”œβ”€β”€ calorie_clip.py # Inference code β”œβ”€β”€ requirements.txt # Dependencies └── assets/ β”œβ”€β”€ training_progress.png β”œβ”€β”€ model_comparison.png β”œβ”€β”€ accuracy_breakdown.png └── error_distribution.png ``` ## πŸ“‹ Training Data Trained on a combined dataset of: - **[Nutrition5k](https://github.com/google-research-datasets/nutrition5k)**: 5,006 real cafeteria food images with professional calorie measurements - **Food-101 subset**: 8,000+ food images with estimated calories - **Total: 13,004 samples** (11,053 train / 1,951 validation) - **Diverse foods**: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more ## ⚠️ Limitations - Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals - Single-dish focused; complex multi-item plates may have higher error - Portion size estimation is inherently challenging from 2D images - Not a replacement for professional nutrition advice ## πŸ™ Citation ```bibtex @software{calorieclip2024, author = {Haplo LLC}, title = {CalorieCLIP: Accurate Food Calorie Estimation from Images}, year = {2024}, url = {https://huggingface.co/jc-builds/CalorieCLIP} } ``` ## πŸ“„ License MIT License - free for commercial and personal use. ---

Made with ❀️ by Haplo LLC