Spaces:
Sleeping
Sleeping
| title: CookBook AI | |
| emoji: 🥘 | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 6.3.0 | |
| app_file: app.py | |
| pinned: false | |
| # Final Project: Data Science Course | |
| ### Project Overview | |
| The goal of this project is to develop an app that takes a scanned image of a handwritten recipe as input, generates text using a VLM, and based on the extracted text, suggests 3 similar recipes from a 10K dataset of synthetic recipes. | |
| Our app will bridge the gap between analog culinary heritage and digital discovery." | |
| ## Part 1: Synthetic Data Generation | |
| ### 1. The Strategy: "One-Shot" Prompting with Random Menus | |
| Instead of asking the AI to "invent a recipe" from scratch (which risks repetition), we created a structured system: | |
| * **The "Menu" Generator:** We defined four cuisine profiles (Italian, Mediterranean, Asian Fusion, Dessert) with lists of adjectives, main dishes, and extras. | |
| * **Randomized Titles:** The script randomly combines these words (e.g., *"Spicy Italian Lasagna with Mushrooms"* or *"Golden Dessert Cake Supreme"*) to create 10,000 unique, distinct prompts. | |
| * **One-Shot Example:** Every prompt includes a single "perfect example" (Grandma's Meatballs) to teach the model exactly how to format the output (Title, Ingredients, Instructions). | |
| ### 2. The Engine: Model Configuration | |
| * **Model:** We used **Qwen 2.5 3B Instruct**, a model chosen for its balance of speed and logic. | |
| * **Optimization:** | |
| * **Batch Processing:** We generated recipes in batches of 64 at a time (instead of 1 by 1) to maximize the speed of the A100 GPU. | |
| * **The "Left-Padding" Fix:** We explicitly set `tokenizer.padding_side = "left"`. This is a critical technical fix for decoder-only models (like GPT/Qwen) to ensure they don't generate gibberish when processing multiple prompts simultaneously. | |
| ### 3. Robust Workflow | |
| To manage the heavy computational load, the script features a **resume function** that detects existing progress (`RecipeData_10K.jsonl`) to avoid restarting if interrupted. It also includes an automated **OOM (Out Of Memory) handler** that dynamically reduces the batch size (from 64 to 8) if the GPU overloads, preventing crashes. | |
| ### 4. Regex Parsing | |
| We implemented a custom Regex-based parser to instantly structure the raw AI output. By targeting specific delimiters (`Title:`, `Ingredients:`, `<END_RECIPE>`), the parser splits the unstructured text block into three clean columns: **Title**, **Ingredients**, and **Instructions**. | |
| ### 5. "Crash-Proof" Saving | |
| The pipeline uses a **streaming save strategy**: every batch is immediately appended to a `.jsonl` file to prevent data loss during long runs. Once the target of 10,000 recipes is reached, this file is converted into a CSV format for final analysis and cleaning. | |
| Evantually we generated 10K recipes, and got a csv containing Title, Ingridiends, Instructions,Raw Data Coulms. | |
|  | |
| --- | |
| ## Part 2: Exploratory Data Analysis (EDA) | |
| Find the dataset and EDA on: https://huggingface.co/datasets/Liori25/10k_recipes | |
| --- | |
| # Part 3: Embeddings | |
| We have selected three distinct Transformer models to evaluate the trade-off between semantic understanding and computational efficiency for our recipe recommendation engine: | |
| **sentence-transformers/all-MiniLM-L6-v2** (The Baseline): Chosen for its extreme speed and compact size (80MB). It represents the industry standard for lightweight CPU-based inference, serving as our baseline for "maximum efficiency." | |
| **sentence-transformers/all-mpnet-base-v2** (The Quality Benchmark): Chosen as the high-accuracy anchor. While significantly larger (420MB) and slower, it consistently ranks highest on semantic search benchmarks, allowing us to measure how much quality we might sacrifice for speed. | |
| **AAI/bge-small-en-v1.5**: Chosen as a potential "best of both worlds" solution. This newer model utilizes advanced pre-training techniques to achieve accuracy comparable to MPNet while maintaining a small footprint (133MB) similar to MiniLM, making it a strong candidate for optimal performance. | |
| ## Part 3: Semantic Search & Model Selection | |
| ### 1. Understanding the Similarity Score | |
| To retrieve the best matching recipes, the system uses vector embeddings. The process works as follows: | |
| 1. **Vectorization:** The model transforms the user's search text (e.g., *"creamy italian rice"*) into a numerical array, which we define as **Vector A**. | |
| 2. **Dataset Mapping:** It performs the same operation on the recipe database (e.g., *"Mushroom Risotto"*), creating **Vector B** for each entry. | |
| 3. **Comparison:** The function compares the single query vector against 2,000 recipe vectors simultaneously. | |
| 4. **Scoring:** It returns an array of similarity scores, such as `[0.12, 0.05, 0.88, 0.15...]`. | |
| 5. **Retrieval:** The algorithm identifies the highest value (e.g., `0.88`) and retrieves the corresponding recipe as the "winner." | |
| ### 2. Embedding Model Selection | |
| We selected **BAAI/bge-small-en-v1.5** as the optimal embedding model for our recipe dataset. | |
|  | |
| * **Performance:** Crucially, it achieved the **highest similarity score** in our evaluation, demonstrating superior semantic understanding compared to the faster but less accurate `all-MiniLM-L6-v2`. | |
| * **Efficiency:** It matched the precision of the resource-heavy `all-mpnet-base-v2` (which requires 420 MB) while maintaining a significantly lighter footprint. | |
| * **Conclusion:** This specific balance allows our system to deliver the most relevant recipe recommendations without compromising on computational efficiency. | |
| # Part 4- IO Pipeline: | |
| on our first try, we tried to use an OCR model:TrOCRProcessor | |
| it wan't successfull, the model couln't predict some handwritten recipes, and sometimes even hulucinated. | |
| we decided to try using: Qwen2.5-VL Vision-Language Model and the results were much better! | |
| Comparison: | |
|  | |
| ## Challenges: | |
| - Synthetic Data: Creating a synthetic dataset that accurately mimicked real-world distributions was difficult. We had to carefully tune the generation parameters to ensure the data was diverse and reliable for training. | |
| - Hugging Face Deployment & Optimization: | |
| The Issue: We struggled initially with configuring the Hugging Face environment; the model execution was too slow for a real-time application. | |
| The Solution: After investigating various deployment strategies, we shifted to using the Hugging Face Inference Client. This transition optimized our inference pipeline, allowing the model to run within acceptable time limits and ensuring a smooth user experience. | |
| The App User Flow: | |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6910977ace661438b728d763/j56UCBK1cLXDcQZpE3UcN.png" width="600" alt="Process Flow Diagram"> | |