Spaces:
Runtime error
Runtime error
Matan Kriel commited on
Commit ·
b6d47fe
1
Parent(s): 13f1929
added files
Browse files- README.md +112 -6
- app.py +113 -0
- food_embeddings.parquet +3 -0
- requirements.txt +8 -0
README.md
CHANGED
|
@@ -1,12 +1,118 @@
|
|
| 1 |
---
|
| 2 |
-
title: Food
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version: 6.
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Food Match
|
| 3 |
+
emoji: 🌍
|
| 4 |
+
colorFrom: pink
|
| 5 |
+
colorTo: green
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 6.1.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
short_description: Trained model to detect and recommend similar foods.
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# 🍔 Visual Dish Matcher AI
|
| 15 |
+
|
| 16 |
+
**A computer vision app that suggests dishes based on visual/text similarity.**
|
| 17 |
+
|
| 18 |
+
## 🎯 Project Overview
|
| 19 |
+
This project explores the power of **Vector Embeddings** in building recommendation systems. Unlike traditional filters (e.g., "Show me Italian food"), this app uses **OpenAI's CLIP model** to "see" the food. It converts images into mathematical vectors and finds matches based on visual content—texture, color, shape, and ingredients.
|
| 20 |
+
|
| 21 |
+
**Live Demo:** [Click "App" tab above to view]
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## 🛠️ Tech Stack
|
| 26 |
+
* **Model:** OpenAI CLIP (`clip-vit-base-patch32`)
|
| 27 |
+
* **Frameworks:** PyTorch, Transformers, Datasets (Hugging Face)
|
| 28 |
+
* **Interface:** Gradio
|
| 29 |
+
* **Data Storage:** Parquet (via Git LFS)
|
| 30 |
+
* **Visualization:** Matplotlib, Seaborn, Scikit-Learn (t-SNE/PCA)
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## DataSet - Food101
|
| 35 |
+
|
| 36 |
+
Food-101 dataset, a popular benchmark for fine-grained image classification. Unlike "clean" studio datasets, Food-101 contains real-world images taken in various lighting conditions, angles, and noise levels, making it highly representative of photos users typically upload to social media or food apps.
|
| 37 |
+
|
| 38 |
+
Key Features:
|
| 39 |
+
|
| 40 |
+
101 Categories: Covers a wide range of international dishes, including Sushi, Pizza, Hamburger, Pad Thai, Baklava, and Chocolate Mousse.
|
| 41 |
+
|
| 42 |
+
"In the Wild" Data: Images are not perfectly centered or lit; they contain background noise (plates, cutlery, restaurant tables), challenging the model to focus on the food itself.
|
| 43 |
+
|
| 44 |
+
Project Subset: To ensure computational efficiency for this assignment, a randomized stratified subset of 5,000 images was selected from the training split.
|
| 45 |
+
|
| 46 |
+
Data Structure:
|
| 47 |
+
|
| 48 |
+
Input: RGB Images (various aspect ratios, resized during processing).
|
| 49 |
+
|
| 50 |
+
Labels: 101 unique Integer IDs mapped to human-readable Class Names.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## 📊 Part 1: Data Exploration (EDA)
|
| 55 |
+
**Dataset:** [Food-101 (ETH Zurich)](https://huggingface.co/datasets/ethz/food101)
|
| 56 |
+
To ensure computational efficiency for the assignment, I utilized a randomized subset of **5,000 images** spanning 101 categories.
|
| 57 |
+
|
| 58 |
+
### 1. Data Cleaning
|
| 59 |
+
Before training, the dataset underwent rigorous cleaning:
|
| 60 |
+
* **Format Correction:** Converted distinct Grayscale images to RGB to ensure compatibility with the CLIP model.
|
| 61 |
+
* **Outlier Detection:** Analyzed image brightness and aspect ratios to identify and flag low-quality or distorted images (e.g., pitch-black photos or extreme panoramas).
|
| 62 |
+
|
| 63 |
+
### 2. Image Distribution
|
| 64 |
+
We verified the class balance to ensure the model wasn't biased toward specific categories.
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+

|
| 68 |
+
|
| 69 |
+
### 3. Dimensionality Analysis
|
| 70 |
+
We analyzed the width vs. height of the dataset to verify that most images were standard sizes suitable for resizing.
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+

|
| 74 |
+
|
| 75 |
+
### Outlier Detection
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+

|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## 🧠 Part 2: Embeddings & Clustering
|
| 83 |
+
The core of the "Visual Matcher" is the embedding space. We generated 512-dimensional vectors for every image in the training set.
|
| 84 |
+
|
| 85 |
+
### Clustering Analysis
|
| 86 |
+
Using **K-Means**, we grouped these vectors to see if the model could automatically discover food categories without being told the labels.
|
| 87 |
+
* **Algorithm:** K-Means (k=50)
|
| 88 |
+
* **Dimensionality Reduction:** t-SNE (to visualize 512D vectors in 2D)
|
| 89 |
+
|
| 90 |
+
**Key Insight:** The model successfully grouped foods by visual properties. For example, "Red/Orange" foods (Pizza, Lasagna) formed distinct clusters separate from "Green" foods (Salads, Guacamole).
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+

|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## 🚀 Part 3: The Application
|
| 99 |
+
The final product is a **Gradio** web application hosted on Hugging Face Spaces. It supports two modes of interaction:
|
| 100 |
+
|
| 101 |
+
1. **Image-to-Image:** The user uploads a photo (e.g., a burger). The app embeds the upload and calculates **Cosine Similarity** against the database to find the nearest visual neighbors.
|
| 102 |
+
2. **Text-to-Image:** The user types a description (e.g., "Spicy Tacos"). The app uses CLIP's text encoder to find images that match the semantic meaning of the text.
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## 📂 Repository Structure
|
| 107 |
+
* `app.py`: Main application logic and Gradio interface.
|
| 108 |
+
* `food_embeddings.parquet`: Pre-computed vector database (stored via Git LFS).
|
| 109 |
+
* `requirements.txt`: Python dependencies.
|
| 110 |
+
* `README.md`: Project documentation.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## ✍️ Author
|
| 115 |
+
**[Matan Kriel]**
|
| 116 |
+
*Assignment #3: Embeddings, RecSys, and Spaces*
|
| 117 |
---
|
| 118 |
|
|
|
app.py
ADDED
|
@@ -0,0 +1,113 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
import torch
|
| 3 |
+
import pandas as pd
|
| 4 |
+
import numpy as np
|
| 5 |
+
from PIL import Image
|
| 6 |
+
from transformers import CLIPProcessor, CLIPModel
|
| 7 |
+
from datasets import load_dataset
|
| 8 |
+
from torch.nn import functional as F
|
| 9 |
+
|
| 10 |
+
# --- 1. SETUP & CONFIG ---
|
| 11 |
+
MODEL_ID = "openai/clip-vit-base-patch32"
|
| 12 |
+
DATA_FILE = "food_embeddings.parquet"
|
| 13 |
+
|
| 14 |
+
print("⏳ Starting App... Loading Model...")
|
| 15 |
+
# Load Model (CPU is fine for inference on single images)
|
| 16 |
+
model = CLIPModel.from_pretrained(MODEL_ID)
|
| 17 |
+
processor = CLIPProcessor.from_pretrained(MODEL_ID)
|
| 18 |
+
|
| 19 |
+
# --- 2. LOAD DATA (Must match Colab logic EXACTLY) ---
|
| 20 |
+
print("⏳ Loading Dataset (this takes a moment)...")
|
| 21 |
+
# We load the same 5000 images using the same seed so indices match the parquet file
|
| 22 |
+
dataset = load_dataset("ethz/food101", split="train").shuffle(seed=42).select(range(5000))
|
| 23 |
+
|
| 24 |
+
# --- 3. LOAD EMBEDDINGS ---
|
| 25 |
+
print("⏳ Loading Pre-computed Embeddings...")
|
| 26 |
+
df = pd.read_parquet(DATA_FILE)
|
| 27 |
+
# Convert the list of numbers in the parquet back to a Torch Tensor
|
| 28 |
+
db_features = torch.tensor(np.stack(df['embedding'].to_numpy()))
|
| 29 |
+
# Normalize once for speed
|
| 30 |
+
db_features = F.normalize(db_features, p=2, dim=1)
|
| 31 |
+
|
| 32 |
+
print("✅ App Ready!")
|
| 33 |
+
|
| 34 |
+
# --- 4. CORE SEARCH LOGIC ---
|
| 35 |
+
def find_best_matches(query_features, top_k=3):
|
| 36 |
+
# Normalize query
|
| 37 |
+
query_features = F.normalize(query_features, p=2, dim=1)
|
| 38 |
+
|
| 39 |
+
# Calculate Similarity (Dot Product)
|
| 40 |
+
# Query (1x512) * DB (5000x512) = Scores (1x5000)
|
| 41 |
+
similarity = torch.mm(query_features, db_features.T)
|
| 42 |
+
|
| 43 |
+
# Get Top K
|
| 44 |
+
scores, indices = torch.topk(similarity, k=top_k)
|
| 45 |
+
|
| 46 |
+
results = []
|
| 47 |
+
for idx, score in zip(indices[0], scores[0]):
|
| 48 |
+
idx = idx.item()
|
| 49 |
+
|
| 50 |
+
# Grab image and info from the loaded dataset
|
| 51 |
+
img = dataset[idx]['image']
|
| 52 |
+
label = df.iloc[idx]['label_name'] # Get label from our dataframe
|
| 53 |
+
|
| 54 |
+
# Format output
|
| 55 |
+
results.append((img, f"{label} ({score:.2f})"))
|
| 56 |
+
return results
|
| 57 |
+
|
| 58 |
+
# --- 5. GRADIO FUNCTIONS ---
|
| 59 |
+
def search_by_image(input_image):
|
| 60 |
+
if input_image is None: return []
|
| 61 |
+
|
| 62 |
+
inputs = processor(images=input_image, return_tensors="pt")
|
| 63 |
+
with torch.no_grad():
|
| 64 |
+
features = model.get_image_features(**inputs)
|
| 65 |
+
|
| 66 |
+
return find_best_matches(features)
|
| 67 |
+
|
| 68 |
+
def search_by_text(input_text):
|
| 69 |
+
if not input_text: return []
|
| 70 |
+
|
| 71 |
+
inputs = processor(text=[input_text], return_tensors="pt", padding=True)
|
| 72 |
+
with torch.no_grad():
|
| 73 |
+
features = model.get_text_features(**inputs)
|
| 74 |
+
|
| 75 |
+
return find_best_matches(features)
|
| 76 |
+
|
| 77 |
+
# --- 6. BUILD UI ---
|
| 78 |
+
with gr.Blocks(title="Food Matcher AI") as demo:
|
| 79 |
+
gr.Markdown("# 🍔 Visual Dish Matcher")
|
| 80 |
+
gr.Markdown("Upload a photo of food (or describe it) to find similar dishes in our database.")
|
| 81 |
+
|
| 82 |
+
# --- VIDEO SECTION ---
|
| 83 |
+
# Using Accordion so it doesn't clutter the UI. Open=False means it starts closed.
|
| 84 |
+
with gr.Accordion("📺 Watch Project Demo", open=False):
|
| 85 |
+
gr.HTML("""
|
| 86 |
+
<div style="display: flex; justify-content: center;">
|
| 87 |
+
<iframe width="560" height="315"
|
| 88 |
+
src="https://www.youtube.com/watch?v=Al665qltkDg&t=4s"
|
| 89 |
+
title="YouTube video player"
|
| 90 |
+
frameborder="0"
|
| 91 |
+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
|
| 92 |
+
allowfullscreen>
|
| 93 |
+
</iframe>
|
| 94 |
+
</div>
|
| 95 |
+
""")
|
| 96 |
+
# ----------------------------
|
| 97 |
+
|
| 98 |
+
with gr.Tab("Image Search"):
|
| 99 |
+
with gr.Row():
|
| 100 |
+
img_input = gr.Image(type="pil", label="Upload Food Image")
|
| 101 |
+
img_gallery = gr.Gallery(label="Top Matches")
|
| 102 |
+
btn_img = gr.Button("Find Similar Dishes")
|
| 103 |
+
btn_img.click(search_by_image, inputs=img_input, outputs=img_gallery)
|
| 104 |
+
|
| 105 |
+
with gr.Tab("Text Search"):
|
| 106 |
+
with gr.Row():
|
| 107 |
+
txt_input = gr.Textbox(label="Describe the food (e.g., 'Spicy Tacos')")
|
| 108 |
+
txt_gallery = gr.Gallery(label="Top Matches")
|
| 109 |
+
btn_txt = gr.Button("Search by Description")
|
| 110 |
+
btn_txt.click(search_by_text, inputs=txt_input, outputs=txt_gallery)
|
| 111 |
+
|
| 112 |
+
# Launch (Disable SSR for stability)
|
| 113 |
+
demo.launch(ssr_mode=False)
|
food_embeddings.parquet
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:87bbf48a556dd11a473e2e168ef6f94cd9bc150ccabbcea9dda084b2cb9ca3b9
|
| 3 |
+
size 8828792
|
requirements.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio
|
| 2 |
+
torch
|
| 3 |
+
transformers
|
| 4 |
+
pandas
|
| 5 |
+
numpy
|
| 6 |
+
datasets
|
| 7 |
+
pyarrow
|
| 8 |
+
scikit-learn
|