Spaces:

MatanKriel
/

food-101

Runtime error

App Files Files Community

food-101 / README.md

MatanKriel

Update README.md

439f03d verified 4 months ago

preview code

raw

history blame contribute delete

6.1 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: Food Matcher AI (SigLIP Edition)
emoji: 🍔
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.0
app_file: app.py
pinned: false

🍔 Visual Dish Matcher AI

A computer vision app that suggests recipes and dishes based on visual similarity using Google's SigLIP model.

🎯 Project Overview

This project builds a Visual Search Engine for food. Instead of relying on text labels (which can be inaccurate or missing), we use Vector Embeddings to find dishes that look similar.

DataSet - Food101

Food-101 dataset, a popular benchmark for fine-grained image classification. Unlike "clean" studio datasets, Food-101 contains real-world images taken in various lighting conditions, angles, and noise levels, making it highly representative of photos users typically upload to social media or food apps.

Key Features:

101 Categories: Covers a wide range of international dishes, including Sushi, Pizza, Hamburger, Pad Thai, Baklava, and Chocolate Mousse.

"In the Wild" Data: Images are not perfectly centered or lit; they contain background noise (plates, cutlery, restaurant tables), challenging the model to focus on the food itself.

Project Subset: To ensure computational efficiency for this assignment, a randomized stratified subset of 5,000 images was selected from the training split.

Data Structure:

Input: RGB Images (various aspect ratios, resized during processing).

Labels: 101 unique Integer IDs mapped to human-readable Class Names.

Key Features:

Multimodal Search: Find food using an image or a text description.
Advanced Data Cleaning: Automated detection of blurry or low-quality images.
Model Comparison: A scientific comparison between OpenAI CLIP and Google SigLIP to choose the best engine.

Live Demo: [Click "App" tab above to view]

🛠️ Tech Stack

Model: Google SigLIP (google/siglip-base-patch16-224)
Frameworks: PyTorch, Transformers, Gradio, Datasets
Data Engineering: OpenCV (Feature Extraction), NumPy
Data Storage: Parquet (via Git LFS)
Visualization: Matplotlib, Seaborn, Scikit-Learn (t-SNE/PCA)

📊 Part 1: Data Analysis & Cleaning

Dataset: Food-101 (ETH Zurich) (Subset of 5,000 images).

1. Exploratory Data Analysis (EDA)

Before any modeling, we analyzed the raw data to ensure quality and balance.

Class Balance Check: We verified that our random subset of 5,000 images maintained a healthy distribution across the 101 food categories (approx. 50 images per class).
Image Dimensions: We visualized the width and height distribution to identify unusually small or large images.
Outlier Detection: We plotted the distribution of Aspect Ratios and Brightness Levels.

2. Data Cleaning

Based on the plots above, we deleted "bad" images that were:

Too Dark (Avg Pixel Intensity < 20)
Too Bright/Washed out (Avg Pixel Intensity > 245)
Extreme Aspect Ratios (Too stretched or squashed, AR > 3.0)

⚔️ Part 2: Model Comparison (CLIP vs. SigLIP vs metaclip)

To ensure the best search results, we ran a "Challenger" test between three leading multimodal models.

The Contestants:

Baseline: OpenAI CLIP (clip-vit-base-patch32)
Challenger: Google SigLIP (siglip-base-patch16-224)
Challenger: Facebook MetaCLIP": ("facebook/metaclip-b32-400m)

The Evaluation:

We compared them using Silhouette Scores (measuring how distinct the food clusters are) and a visual "Taste Test" (checking nearest neighbors for specific dishes).

Metric: Silhouette Score
Winner: Google SigLIP (Produced cleaner, more distinct clusters and better visual matches).

Visual Comparison: We queried both models with the same image to see which returned more accurate similar foods.

🧠 Part 3: Embeddings & Clustering

Using the winning model (SigLIP), We applied dimensionality reduction to visualize how the AI groups food concepts.

Algorithm: K-Means Clustering (k=101 categories).
Visualization:
- PCA: To see the global variance.
- t-SNE: To see local groupings (e.g., "Sushi" clusters separately from "Burgers").

🚀 Part 4: The Application

The final product is a Gradio web application hosted on Hugging Face Spaces.

Image-to-Image: Upload a photo (e.g., a burger) -> The app embeds it using SigLIP -> Finds the nearest 3 visual matches.
Text-to-Image: Type "Spicy Tacos" -> The app finds images matching that description.

Note

The application is running the clip model even though the sigLip model won, sigLip was to big to be run on the hugging face space free tier

How to Run Locally

Clone the repository:

git clone [https://huggingface.co/spaces/YOUR_USERNAME/Food_Recommender](https://huggingface.co/spaces/YOUR_USERNAME/Food-Match)
cd Food-Match

Install dependencies:
```
pip install -r requirements.txt
```
Run the app:
```
python app.py
```

📂 Repository Structure

app.py: Main application logic (Gradio + SigLIP).
food_embeddings.parquet: Pre-computed vector database.
requirements.txt: Python dependencies (includes sentencepiece, protobuf).
README.md: Project documentation.

✍️ Authors

Matan Kriel Odeya Shmuel Assignment #3: Embeddings, RecSys, and Spaces