| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - vision |
| - food |
| - nutrition |
| - calorie-estimation |
| - clip |
| - image-classification |
| - health |
| datasets: |
| - nutrition5k |
| metrics: |
| - mae |
| pipeline_tag: image-to-text |
| library_name: open-clip |
| --- |
| |
| # π CalorieCLIP: Accurate Food Calorie Estimation |
|
|
| <p align="center"> |
| <img src="assets/model_comparison.png" width="700" alt="CalorieCLIP vs Other Models"> |
| </p> |
|
|
| **CalorieCLIP** is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device. |
|
|
| ## π― Key Results |
|
|
| | Metric | Value | |
| |--------|-------| |
| | **Mean Absolute Error** | **51.4 calories** | |
| | Within 50 calories | 67.6% | |
| | Within 100 calories | 90.5% | |
| | Inference Speed | <50ms on M1 Mac | |
|
|
| <p align="center"> |
| <img src="assets/accuracy_breakdown.png" width="500" alt="Accuracy Breakdown"> |
| </p> |
|
|
| ## π½οΈ Example Predictions |
|
|
| Real predictions from our validation set across multiple datasets: |
|
|
| | Image | Food | Dataset | Actual | Predicted | Error | |
| |-------|------|---------|--------|-----------|-------| |
| |  | Hamburger | Food-101 | 558 | 555 | 3 | |
| |  | Ramen | Food-101 | 431 | 437 | 6 | |
| |  | Greek Salad | Food-101 | 144 | 143 | 1 | |
| |  | Sashimi | Food-101 | 156 | 156 | 0 | |
| |  | Cafeteria Meal | Nutrition5k | 88 | 88 | 0 | |
| |  | Cafeteria Meal | Nutrition5k | 138 | 138 | 0 | |
| |  | Cafeteria Meal | Nutrition5k | 330 | 334 | 3 | |
| |  | Cafeteria Meal | Nutrition5k | 214 | 217 | 4 | |
|
|
| ## π Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install open-clip-torch torch pillow |
| ``` |
|
|
| ### Python Usage |
|
|
| ```python |
| # Clone or download this repo first, then: |
| from calorie_clip import CalorieCLIP |
| |
| # Load model from local directory |
| model = CalorieCLIP.from_pretrained(".") |
| |
| # Predict calories |
| calories = model.predict("food_photo.jpg") |
| print(f"Estimated: {calories:.0f} calories") |
| |
| # Batch prediction |
| images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"] |
| results = model.predict_batch(images) |
| ``` |
|
|
| ### Direct Usage (no wrapper) |
|
|
| ```python |
| import torch |
| import open_clip |
| from PIL import Image |
| |
| # Load CLIP |
| clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai') |
| checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False) |
| clip.load_state_dict(checkpoint['clip_state'], strict=False) |
| |
| # Load regression head |
| import torch.nn as nn |
| class RegressionHead(nn.Module): |
| def __init__(self): |
| super().__init__() |
| self.net = nn.Sequential( |
| nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4), |
| nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3), |
| nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1) |
| ) |
| def forward(self, x): return self.net(x) |
| |
| head = RegressionHead() |
| head.load_state_dict(checkpoint['regressor_state']) |
| clip.eval(); head.eval() |
| |
| # Predict |
| img = preprocess(Image.open('food.jpg')).unsqueeze(0) |
| with torch.no_grad(): |
| features = clip.encode_image(img) |
| calories = head(features).item() |
| print(f"{calories:.0f} calories") |
| ``` |
|
|
| ### Command Line |
|
|
| ```bash |
| python calorie_clip.py my_food_image.jpg |
| # Output: my_food_image.jpg: 342 calories |
| ``` |
|
|
| ## π Training Progress |
|
|
| <p align="center"> |
| <img src="assets/training_progress.png" width="800" alt="Training Progress"> |
| </p> |
|
|
| The model was trained for 30 epochs on the Nutrition5k dataset with: |
| - **Huber Loss** for robustness to outliers |
| - **Strong augmentation** (rotation, color jitter, flips) |
| - **Fine-tuning last 2 CLIP transformer blocks** (9.4% of parameters) |
| - **Differential learning rates** (1e-5 for CLIP, 1e-3 for regression head) |
|
|
| ## π¬ Technical Details |
|
|
| ### Architecture |
|
|
| ``` |
| βββββββββββββββββββ ββββββββββββββββ βββββββββββββββ |
| β Food Image ββββββΆβ CLIP ViT-B ββββββΆβ Regression ββββββΆ Calories |
| β (224Γ224) β β Encoder β β Head β |
| βββββββββββββββββββ β (fine-tuned)β β (4 layers) β |
| ββββββββββββββββ βββββββββββββββ |
| β |
| βΌ |
| 512-dim features |
| ``` |
|
|
| ### Model Specs |
|
|
| - **Base Model**: OpenAI CLIP ViT-B/32 |
| - **Fine-tuned Layers**: Last 2 transformer blocks + regression head |
| - **Trainable Parameters**: 9.4% (8.5M of 90M) |
| - **Input Size**: 224Γ224 RGB |
| - **Output**: Single float (calories) |
|
|
| ### Comparison to VLMs |
|
|
| We tested multiple Vision-Language Models on the same test set: |
|
|
| <p align="center"> |
| <img src="assets/error_distribution.png" width="600" alt="Error Distribution"> |
| </p> |
|
|
| | Model | MAE | Notes | |
| |-------|-----|-------| |
| | **CalorieCLIP (Ours)** | **51.4** | Local, fast, accurate | |
| | Claude 3.5 Sonnet | 71.7 | API required | |
| | GPT-4o | 80.2 | API required | |
| | Gemini 1.5 Pro | 86.7 | API required | |
| | GPT-4o-mini | 88.7 | API required | |
| | Qwen2-VL-7B (Local) | 160.7 | Mode collapse issues | |
|
|
| **Key Finding**: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely. |
|
|
| ## π Files |
|
|
| ``` |
| CalorieCLIP/ |
| βββ config.json # Model configuration |
| βββ calorie_clip.pt # Model weights (PyTorch) |
| βββ calorie_clip.py # Inference code |
| βββ requirements.txt # Dependencies |
| βββ assets/ |
| βββ training_progress.png |
| βββ model_comparison.png |
| βββ accuracy_breakdown.png |
| βββ error_distribution.png |
| ``` |
|
|
| ## π Training Data |
|
|
| Trained on a combined dataset of: |
| - **[Nutrition5k](https://github.com/google-research-datasets/nutrition5k)**: 5,006 real cafeteria food images with professional calorie measurements |
| - **Food-101 subset**: 8,000+ food images with estimated calories |
| - **Total: 13,004 samples** (11,053 train / 1,951 validation) |
| - **Diverse foods**: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more |
|
|
| ## β οΈ Limitations |
|
|
| - Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals |
| - Single-dish focused; complex multi-item plates may have higher error |
| - Portion size estimation is inherently challenging from 2D images |
| - Not a replacement for professional nutrition advice |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @software{calorieclip2024, |
| author = {Haplo LLC}, |
| title = {CalorieCLIP: Accurate Food Calorie Estimation from Images}, |
| year = {2024}, |
| url = {https://huggingface.co/jc-builds/CalorieCLIP} |
| } |
| ``` |
|
|
| ## π License |
|
|
| MIT License - free for commercial and personal use. |
|
|
| --- |
|
|
| <p align="center"> |
| Made with β€οΈ by <a href="https://haploapp.com">Haplo LLC</a> |
| </p> |
|
|