CookBookAI / README.md
Liori25's picture
Update README.md
e8ad224 verified
|
raw
history blame
10.3 kB
metadata
title: Legacy Kitchen
emoji: 👵
colorFrom: yellow
colorTo: red
sdk: gradio
sdk_version: 6.3.0
app_file: app.py
pinned: false

Final Project: Data Science Course

Project Overview

This project involves building an AI-powered application that digitizes handwritten recipes from images using Optical Character Recognition (OCR) and Natural Language Processing (NLP). By generating vector embeddings of the extracted text, the system identifies and retrieves three semantically similar recipes from a synthetically generated dataset of 10,000 entries. The final solution is deployed as an interactive web interface on Hugging Face Spaces, bridging the gap between physical archives and digital accessibility.

Part 1: Synthetic Data Generation

1. The Strategy: "One-Shot" Prompting with Random Menus

Instead of asking the AI to "invent a recipe" from scratch (which risks repetition), we created a structured system:

  • The "Menu" Generator: We defined four cuisine profiles (Italian, Mediterranean, Asian Fusion, Dessert) with lists of adjectives, main dishes, and extras.
  • Randomized Titles: The script randomly combines these words (e.g., "Spicy Italian Lasagna with Mushrooms" or "Golden Dessert Cake Supreme") to create 10,000 unique, distinct prompts.
  • One-Shot Example: Every prompt includes a single "perfect example" (Grandma's Meatballs) to teach the model exactly how to format the output (Title, Ingredients, Instructions).

2. The Engine: Model Configuration

  • Model: We used Qwen 2.5 3B Instruct, a model chosen for its balance of speed and logic.
  • Optimization:
    • Batch Processing: We generated recipes in batches of 64 at a time (instead of 1 by 1) to maximize the speed of the A100 GPU.
    • The "Left-Padding" Fix: We explicitly set tokenizer.padding_side = "left". This is a critical technical fix for decoder-only models (like GPT/Qwen) to ensure they don't generate gibberish when processing multiple prompts simultaneously.

3. Robust Workflow

To manage the heavy computational load, the script features a resume function that detects existing progress (RecipeData_10K.jsonl) to avoid restarting if interrupted. It also includes an automated OOM (Out Of Memory) handler that dynamically reduces the batch size (from 64 to 8) if the GPU overloads, preventing crashes.

4. Regex Parsing

We implemented a custom Regex-based parser to instantly structure the raw AI output. By targeting specific delimiters (Title:, Ingredients:, <END_RECIPE>), the parser splits the unstructured text block into three clean columns: Title, Ingredients, and Instructions.

5. "Crash-Proof" Saving

The pipeline uses a streaming save strategy: every batch is immediately appended to a .jsonl file to prevent data loss during long runs. Once the target of 10,000 recipes is reached, this file is converted into a CSV format for final analysis and cleaning.


Part 2: Exploratory Data Analysis (EDA)

1. Data Validation & Structure

We began by performing rigorous validation, checking for row duplicates, empty columns, and title repetitions.

  • Duplicate Analysis: The dataset contains 6,932 duplicate titles (out of 10,000 entries), yet the duplicate row count is 0.
  • Interpretation: This indicates that while many recipes share the same generated name (e.g., "Rustic Italian Pasta"), the actual content (Ingredients and Instructions) is unique for every single entry.

We verified that duplicates are distributed evenly across thousands of recipes rather than clustering around a single dish (e.g., "Pasta" appearing 5,000 times). This confirms the data generator successfully distributed prompts widely.

Top 20 Titles Distribution: Top 20 Titles

Verification of Unique Content: We also manually verified that identical titles possess distinctly different recipe instructions: Unique Instructions

2. Word Count Filtering

To ensure data quality, we inspected recipes with low word counts (shorter than 40 and 20 words).

  • Findings: Entries with fewer than 20 words were consistently invalid (e.g., containing only "(INVALID INPUT PROVIDED)").
  • Action: These rows were dropped from the dataset.

3. AI Error Detection

We implemented a validation step to identify and remove "non-sentient" artifacts—AI refusals or system error messages that contain no culinary value.

  • Method: We scanned for key error phrases such as "AI helper," "Invalid request," or apology text.
  • Example Identified: "The instruction to generate an example recipe in a specific format has been misunderstood... I apologize for the oversight."
  • Action: A total of 16 recipes were flagged and removed to ensure a clean dataset.

4. Instruction Variety

Finally, we checked for duplications specifically within the Instructions column. This confirmed that the generation model produced a true variety of cooking methods and steps, rather than repeating generic text blocks.

5. Data Visualization

image

image

image

image

image

Outliers

image

  1. Word Count Analysis Outliers & Distribution: The data is heavily concentrated between 80 and 95 words, with significant low-end outliers suggesting a subset of very brief recipes.

Contextual Variation: This spread is natural, as "short and descriptive" recipes focus on efficiency, while those nearing the 120-word mark likely include "the story behind the recipe" or extra cultural context.

  1. Character Length Analysis Outliers & Distribution: The character count distribution is smoother and centered around 500–550 characters, though it shows more high-end outliers than the word count plot.

Vocabulary Density: These outliers highlight the difference between recipes using simple language and those utilizing longer, technical culinary terms or descriptive narratives.

These data points should not be classified as technical outliers because they reflect the natural stylistic variance found in real-world culinary writing. Shorter entries correspond to concise, descriptive instructions focused purely on efficiency, while longer entries rightfully include narrative elements or the "story behind the recipe." Therefore, this spread in word and character counts indicates a healthy, diverse dataset that mirrors authentic human authorship rather than data quality errors.


Part 3: Embeddings

We have selected three distinct Transformer models to evaluate the trade-off between semantic understanding and computational efficiency for our recipe recommendation engine:

sentence-transformers/all-MiniLM-L6-v2 (The Baseline): Chosen for its extreme speed and compact size (80MB). It represents the industry standard for lightweight CPU-based inference, serving as our baseline for "maximum efficiency."

sentence-transformers/all-mpnet-base-v2 (The Quality Benchmark): Chosen as the high-accuracy anchor. While significantly larger (420MB) and slower, it consistently ranks highest on semantic search benchmarks, allowing us to measure how much quality we might sacrifice for speed.

AAI/bge-small-en-v1.5: Chosen as a potential "best of both worlds" solution. This newer model utilizes advanced pre-training techniques to achieve accuracy comparable to MPNet while maintaining a small footprint (133MB) similar to MiniLM, making it a strong candidate for optimal performance.

Part 3: Semantic Search & Model Selection

1. Understanding the Similarity Score

To retrieve the best matching recipes, the system uses vector embeddings. The process works as follows:

  1. Vectorization: The model transforms the user's search text (e.g., "creamy italian rice") into a numerical array, which we define as Vector A.
  2. Dataset Mapping: It performs the same operation on the recipe database (e.g., "Mushroom Risotto"), creating Vector B for each entry.
  3. Comparison: The function compares the single query vector against 2,000 recipe vectors simultaneously.
  4. Scoring: It returns an array of similarity scores, such as [0.12, 0.05, 0.88, 0.15...].
  5. Retrieval: The algorithm identifies the highest value (e.g., 0.88) and retrieves the corresponding recipe as the "winner."

2. Embedding Model Selection

We selected BAAI/bge-small-en-v1.5 as the optimal embedding model for our recipe dataset. image

  • Performance: Crucially, it achieved the highest similarity score in our evaluation, demonstrating superior semantic understanding compared to the faster but less accurate all-MiniLM-L6-v2.
  • Efficiency: It matched the precision of the resource-heavy all-mpnet-base-v2 (which requires 420 MB) while maintaining a significantly lighter footprint.
  • Conclusion: This specific balance allows our system to deliver the most relevant recipe recommendations without compromising on computational efficiency.

Part 4- IO Pipeline:

on our first try, we tried to use an OCR model:TrOCRProcessor it wan't successfull, the model couln't predict some handwritten recipes, and sometimes even hulucinated.

we decided to try using: Qwen2.5-VL Vision-Language Model and the results were much better! Comparison:

image