Spaces:

Liori25
/

CookBookAI

Sleeping

App Files Files Community

CookBookAI / README.md

Liori25

Update README.md

b8a4860 verified 19 days ago

preview code

raw

history blame contribute delete

7.03 kB

	---
	title: CookBook AI
	emoji: 🥘
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 6.3.0
	app_file: app.py
	pinned: false
	---
	# Final Project: Data Science Course


	### Project Overview
	The goal of this project is to develop an app that takes a scanned image of a handwritten recipe as input, generates text using a VLM, and based on the extracted text, suggests 3 similar recipes from a 10K dataset of synthetic recipes.
	Our app will bridge the gap between analog culinary heritage and digital discovery."

	## Part 1: Synthetic Data Generation

	### 1. The Strategy: "One-Shot" Prompting with Random Menus
	Instead of asking the AI to "invent a recipe" from scratch (which risks repetition), we created a structured system:

	* The "Menu" Generator: We defined four cuisine profiles (Italian, Mediterranean, Asian Fusion, Dessert) with lists of adjectives, main dishes, and extras.
	* Randomized Titles: The script randomly combines these words (e.g., "Spicy Italian Lasagna with Mushrooms" or "Golden Dessert Cake Supreme") to create 10,000 unique, distinct prompts.
	* One-Shot Example: Every prompt includes a single "perfect example" (Grandma's Meatballs) to teach the model exactly how to format the output (Title, Ingredients, Instructions).

	### 2. The Engine: Model Configuration
	* Model: We used Qwen 2.5 3B Instruct, a model chosen for its balance of speed and logic.
	* Optimization:
	* Batch Processing: We generated recipes in batches of 64 at a time (instead of 1 by 1) to maximize the speed of the A100 GPU.
	* The "Left-Padding" Fix: We explicitly set `tokenizer.padding_side = "left"`. This is a critical technical fix for decoder-only models (like GPT/Qwen) to ensure they don't generate gibberish when processing multiple prompts simultaneously.

	### 3. Robust Workflow
	To manage the heavy computational load, the script features a resume function that detects existing progress (`RecipeData_10K.jsonl`) to avoid restarting if interrupted. It also includes an automated OOM (Out Of Memory) handler that dynamically reduces the batch size (from 64 to 8) if the GPU overloads, preventing crashes.

	### 4. Regex Parsing
	We implemented a custom Regex-based parser to instantly structure the raw AI output. By targeting specific delimiters (`Title:`, `Ingredients:`, `<END_RECIPE>`), the parser splits the unstructured text block into three clean columns: Title, Ingredients, and Instructions.

	### 5. "Crash-Proof" Saving
	The pipeline uses a streaming save strategy: every batch is immediately appended to a `.jsonl` file to prevent data loss during long runs. Once the target of 10,000 recipes is reached, this file is converted into a CSV format for final analysis and cleaning.

	Evantually we generated 10K recipes, and got a csv containing Title, Ingridiends, Instructions,Raw Data Coulms.

	![image](https://cdn-uploads.huggingface.co/production/uploads/6910977ace661438b728d763/_o00wAF8_qoEV6k9yZ3x8.png)

	---

	## Part 2: Exploratory Data Analysis (EDA)
	Find the dataset and EDA on: https://huggingface.co/datasets/Liori25/10k_recipes

	---

	# Part 3: Embeddings
	We have selected three distinct Transformer models to evaluate the trade-off between semantic understanding and computational efficiency for our recipe recommendation engine:

	sentence-transformers/all-MiniLM-L6-v2 (The Baseline): Chosen for its extreme speed and compact size (80MB). It represents the industry standard for lightweight CPU-based inference, serving as our baseline for "maximum efficiency."

	sentence-transformers/all-mpnet-base-v2 (The Quality Benchmark): Chosen as the high-accuracy anchor. While significantly larger (420MB) and slower, it consistently ranks highest on semantic search benchmarks, allowing us to measure how much quality we might sacrifice for speed.

	AAI/bge-small-en-v1.5: Chosen as a potential "best of both worlds" solution. This newer model utilizes advanced pre-training techniques to achieve accuracy comparable to MPNet while maintaining a small footprint (133MB) similar to MiniLM, making it a strong candidate for optimal performance.




	## Part 3: Semantic Search & Model Selection

	### 1. Understanding the Similarity Score
	To retrieve the best matching recipes, the system uses vector embeddings. The process works as follows:

	1. Vectorization: The model transforms the user's search text (e.g., "creamy italian rice") into a numerical array, which we define as Vector A.
	2. Dataset Mapping: It performs the same operation on the recipe database (e.g., "Mushroom Risotto"), creating Vector B for each entry.
	3. Comparison: The function compares the single query vector against 2,000 recipe vectors simultaneously.
	4. Scoring: It returns an array of similarity scores, such as `[0.12, 0.05, 0.88, 0.15...]`.
	5. Retrieval: The algorithm identifies the highest value (e.g., `0.88`) and retrieves the corresponding recipe as the "winner."

	### 2. Embedding Model Selection
	We selected BAAI/bge-small-en-v1.5 as the optimal embedding model for our recipe dataset.
	![image](https://cdn-uploads.huggingface.co/production/uploads/6910977ace661438b728d763/qtXVG31H9aWg3B1XZ1ZU1.png)

	* Performance: Crucially, it achieved the highest similarity score in our evaluation, demonstrating superior semantic understanding compared to the faster but less accurate `all-MiniLM-L6-v2`.
	* Efficiency: It matched the precision of the resource-heavy `all-mpnet-base-v2` (which requires 420 MB) while maintaining a significantly lighter footprint.
	* Conclusion: This specific balance allows our system to deliver the most relevant recipe recommendations without compromising on computational efficiency.

	# Part 4- IO Pipeline:
	on our first try, we tried to use an OCR model:TrOCRProcessor
	it wan't successfull, the model couln't predict some handwritten recipes, and sometimes even hulucinated.

	we decided to try using: Qwen2.5-VL Vision-Language Model and the results were much better!
	Comparison:

	![image](https://cdn-uploads.huggingface.co/production/uploads/6910977ace661438b728d763/tSsR5JNuGO_yNBRCxoH3s.png)


	## Challenges:
	- Synthetic Data: Creating a synthetic dataset that accurately mimicked real-world distributions was difficult. We had to carefully tune the generation parameters to ensure the data was diverse and reliable for training.

	- Hugging Face Deployment & Optimization:
	The Issue: We struggled initially with configuring the Hugging Face environment; the model execution was too slow for a real-time application.
	The Solution: After investigating various deployment strategies, we shifted to using the Hugging Face Inference Client. This transition optimized our inference pipeline, allowing the model to run within acceptable time limits and ensuring a smooth user experience.


	The App User Flow:
	<img src="https://cdn-uploads.huggingface.co/production/uploads/6910977ace661438b728d763/j56UCBK1cLXDcQZpE3UcN.png" width="600" alt="Process Flow Diagram">