Jared

Fix Python usage examples with working code

26b9bd6 2 days ago

7.27 kB

	---
	license: mit
	language:
	- en
	tags:
	- vision
	- food
	- nutrition
	- calorie-estimation
	- clip
	- image-classification
	- health
	datasets:
	- nutrition5k
	metrics:
	- mae
	pipeline_tag: image-to-text
	library_name: open-clip
	---

	# 🍎 CalorieCLIP: Accurate Food Calorie Estimation

	<p align="center">
	<img src="assets/model_comparison.png" width="700" alt="CalorieCLIP vs Other Models">
	</p>

	CalorieCLIP is a fine-tuned CLIP model that estimates calories from food images with state-of-the-art accuracy. It outperforms all tested VLMs (including GPT-4o and Claude) while running entirely on-device.

	## 🎯 Key Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Mean Absolute Error \| 51.4 calories \|
	\| Within 50 calories \| 67.6% \|
	\| Within 100 calories \| 90.5% \|
	\| Inference Speed \| <50ms on M1 Mac \|

	<p align="center">
	<img src="assets/accuracy_breakdown.png" width="500" alt="Accuracy Breakdown">
	</p>

	## 🍽️ Example Predictions

	Real predictions from our validation set across multiple datasets:

	\| Image \| Food \| Dataset \| Actual \| Predicted \| Error \|
	\|-------\|------\|---------\|--------\|-----------\|-------\|
	\| ![Example 1](assets/examples/example_1.png) \| Hamburger \| Food-101 \| 558 \| 555 \| 3 \|
	\| ![Example 2](assets/examples/example_2.png) \| Ramen \| Food-101 \| 431 \| 437 \| 6 \|
	\| ![Example 3](assets/examples/example_3.png) \| Greek Salad \| Food-101 \| 144 \| 143 \| 1 \|
	\| ![Example 4](assets/examples/example_4.png) \| Sashimi \| Food-101 \| 156 \| 156 \| 0 \|
	\| ![Example 5](assets/examples/example_5.png) \| Cafeteria Meal \| Nutrition5k \| 88 \| 88 \| 0 \|
	\| ![Example 6](assets/examples/example_6.png) \| Cafeteria Meal \| Nutrition5k \| 138 \| 138 \| 0 \|
	\| ![Example 7](assets/examples/example_7.png) \| Cafeteria Meal \| Nutrition5k \| 330 \| 334 \| 3 \|
	\| ![Example 8](assets/examples/example_8.png) \| Cafeteria Meal \| Nutrition5k \| 214 \| 217 \| 4 \|

	## 🚀 Quick Start

	### Installation

	```bash
	pip install open-clip-torch torch pillow
	```

	### Python Usage

	```python
	# Clone or download this repo first, then:
	from calorie_clip import CalorieCLIP

	# Load model from local directory
	model = CalorieCLIP.from_pretrained(".")

	# Predict calories
	calories = model.predict("food_photo.jpg")
	print(f"Estimated: {calories:.0f} calories")

	# Batch prediction
	images = ["breakfast.jpg", "lunch.jpg", "dinner.jpg"]
	results = model.predict_batch(images)
	```

	### Direct Usage (no wrapper)

	```python
	import torch
	import open_clip
	from PIL import Image

	# Load CLIP
	clip, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')
	checkpoint = torch.load('calorie_clip.pt', map_location='cpu', weights_only=False)
	clip.load_state_dict(checkpoint['clip_state'], strict=False)

	# Load regression head
	import torch.nn as nn
	class RegressionHead(nn.Module):
	def __init__(self):
	super().__init__()
	self.net = nn.Sequential(
	nn.Linear(512, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(0.4),
	nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),
	nn.Linear(256, 64), nn.ReLU(), nn.Linear(64, 1)
	)
	def forward(self, x): return self.net(x)

	head = RegressionHead()
	head.load_state_dict(checkpoint['regressor_state'])
	clip.eval(); head.eval()

	# Predict
	img = preprocess(Image.open('food.jpg')).unsqueeze(0)
	with torch.no_grad():
	features = clip.encode_image(img)
	calories = head(features).item()
	print(f"{calories:.0f} calories")
	```

	### Command Line

	```bash
	python calorie_clip.py my_food_image.jpg
	# Output: my_food_image.jpg: 342 calories
	```

	## 📊 Training Progress

	<p align="center">
	<img src="assets/training_progress.png" width="800" alt="Training Progress">
	</p>

	The model was trained for 30 epochs on the Nutrition5k dataset with:
	- Huber Loss for robustness to outliers
	- Strong augmentation (rotation, color jitter, flips)
	- Fine-tuning last 2 CLIP transformer blocks (9.4% of parameters)
	- Differential learning rates (1e-5 for CLIP, 1e-3 for regression head)

	## 🔬 Technical Details

	### Architecture

	```
	┌─────────────────┐ ┌──────────────┐ ┌─────────────┐
	│ Food Image │────▶│ CLIP ViT-B │────▶│ Regression │────▶ Calories
	│ (224×224) │ │ Encoder │ │ Head │
	└─────────────────┘ │ (fine-tuned)│ │ (4 layers) │
	└──────────────┘ └─────────────┘
	│
	▼
	512-dim features
	```

	### Model Specs

	- Base Model: OpenAI CLIP ViT-B/32
	- Fine-tuned Layers: Last 2 transformer blocks + regression head
	- Trainable Parameters: 9.4% (8.5M of 90M)
	- Input Size: 224×224 RGB
	- Output: Single float (calories)

	### Comparison to VLMs

	We tested multiple Vision-Language Models on the same test set:

	<p align="center">
	<img src="assets/error_distribution.png" width="600" alt="Error Distribution">
	</p>

	\| Model \| MAE \| Notes \|
	\|-------\|-----\|-------\|
	\| CalorieCLIP (Ours) \| 51.4 \| Local, fast, accurate \|
	\| Claude 3.5 Sonnet \| 71.7 \| API required \|
	\| GPT-4o \| 80.2 \| API required \|
	\| Gemini 1.5 Pro \| 86.7 \| API required \|
	\| GPT-4o-mini \| 88.7 \| API required \|
	\| Qwen2-VL-7B (Local) \| 160.7 \| Mode collapse issues \|

	Key Finding: All tested local VLMs (Qwen, Pixtral) suffered from mode collapse, outputting the same calorie value for all images. CalorieCLIP's regression approach avoids this entirely.

	## 📁 Files

	```
	CalorieCLIP/
	├── config.json # Model configuration
	├── calorie_clip.pt # Model weights (PyTorch)
	├── calorie_clip.py # Inference code
	├── requirements.txt # Dependencies
	└── assets/
	├── training_progress.png
	├── model_comparison.png
	├── accuracy_breakdown.png
	└── error_distribution.png
	```

	## 📋 Training Data

	Trained on a combined dataset of:
	- [Nutrition5k](https://github.com/google-research-datasets/nutrition5k): 5,006 real cafeteria food images with professional calorie measurements
	- Food-101 subset: 8,000+ food images with estimated calories
	- Total: 13,004 samples (11,053 train / 1,951 validation)
	- Diverse foods: beignets, prime rib, ramen, hamburgers, bruschetta, chicken wings, pork chops, greek salads, sashimi, and more

	## ⚠️ Limitations

	- Trained on cafeteria food; may be less accurate for restaurant/home-cooked meals
	- Single-dish focused; complex multi-item plates may have higher error
	- Portion size estimation is inherently challenging from 2D images
	- Not a replacement for professional nutrition advice

	## 🙏 Citation

	```bibtex
	@software{calorieclip2024,
	author = {Haplo LLC},
	title = {CalorieCLIP: Accurate Food Calorie Estimation from Images},
	year = {2024},
	url = {https://huggingface.co/jc-builds/CalorieCLIP}
	}
	```

	## 📄 License

	MIT License - free for commercial and personal use.

	---

	<p align="center">
	Made with ❤️ by <a href="https://haploapp.com">Haplo LLC</a>
	</p>