Liori25 commited on
Commit
cee0a18
·
verified ·
1 Parent(s): bca68c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -12
README.md CHANGED
@@ -1,12 +1,29 @@
1
- ---
2
- title: AppLegacyKitchen
3
- emoji: 📊
4
- colorFrom: red
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 6.3.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Final Project: Data Science Course
2
+
3
+
4
+ ### Project Overview
5
+ This project involves building an AI-powered application that digitizes handwritten recipes from images using Optical Character Recognition (OCR) and Natural Language Processing (NLP). By generating vector embeddings of the extracted text, the system identifies and retrieves three semantically similar recipes from a synthetically generated dataset of 10,000 entries. The final solution is deployed as an interactive web interface on Hugging Face Spaces, bridging the gap between physical archives and digital accessibility.
6
+
7
+ ## Part 1: Synthetic Data Generation
8
+
9
+ ### 1. The Strategy: "One-Shot" Prompting with Random Menus
10
+ Instead of asking the AI to "invent a recipe" from scratch (which risks repetition), we created a structured system:
11
+
12
+ * **The "Menu" Generator:** We defined four cuisine profiles (Italian, Mediterranean, Asian Fusion, Dessert) with lists of adjectives, main dishes, and extras.
13
+ * **Randomized Titles:** The script randomly combines these words (e.g., *"Spicy Italian Lasagna with Mushrooms"* or *"Golden Dessert Cake Supreme"*) to create 10,000 unique, distinct prompts.
14
+ * **One-Shot Example:** Every prompt includes a single "perfect example" (Grandma's Meatballs) to teach the model exactly how to format the output (Title, Ingredients, Instructions).
15
+
16
+ ### 2. The Engine: Model Configuration
17
+ * **Model:** We used **Qwen 2.5 3B Instruct**, a model chosen for its balance of speed and logic.
18
+ * **Optimization:**
19
+ * **Batch Processing:** We generated recipes in batches of 64 at a time (instead of 1 by 1) to maximize the speed of the A100 GPU.
20
+ * **The "Left-Padding" Fix:** We explicitly set `tokenizer.padding_side = "left"`. This is a critical technical fix for decoder-only models (like GPT/Qwen) to ensure they don't generate gibberish when processing multiple prompts simultaneously.
21
+
22
+ ### 3. Robust Workflow
23
+ To manage the heavy computational load, the script features a **resume function** that detects existing progress (`RecipeData_10K.jsonl`) to avoid restarting if interrupted. It also includes an automated **OOM (Out Of Memory) handler** that dynamically reduces the batch size (from 64 to 8) if the GPU overloads, preventing crashes.
24
+
25
+ ### 4. Regex Parsing
26
+ We implemented a custom Regex-based parser to instantly structure the raw AI output. By targeting specific delimiters (`Title:`, `Ingredients:`, `<END_RECIPE>`), the parser splits the unstructured text block into three clean columns: **Title**, **Ingredients**, and **Instructions**.
27
+
28
+ ### 5. "Crash-Proof" Saving
29
+ The pipeline uses a **streaming save strategy**: every batch is immediately appended to a `.jsonl` file to prevent data loss during long runs. Once the target of 10,000 recipes is reached, this file is converted into a CSV format for final analysis and cleaning.