Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -46,69 +46,7 @@ Evantually we generated 10K recipes, and got a csv containing Title, Ingridiends
|
|
| 46 |
---
|
| 47 |
|
| 48 |
## Part 2: Exploratory Data Analysis (EDA)
|
| 49 |
-
|
| 50 |
-
### 1. Data Validation & Structure
|
| 51 |
-
We began by performing rigorous validation, checking for row duplicates, empty columns, and title repetitions.
|
| 52 |
-
|
| 53 |
-
* **Duplicate Analysis:** The dataset contains **6,932 duplicate titles** (out of 10,000 entries), yet the **duplicate row count is 0**.
|
| 54 |
-
* **Interpretation:** This indicates that while many recipes share the same generated name (e.g., *"Rustic Italian Pasta"*), the actual content (Ingredients and Instructions) is unique for every single entry.
|
| 55 |
-
|
| 56 |
-
We verified that duplicates are distributed evenly across thousands of recipes rather than clustering around a single dish (e.g., "Pasta" appearing 5,000 times). This confirms the data generator successfully distributed prompts widely.
|
| 57 |
-
|
| 58 |
-
**Top 20 Titles Distribution:**
|
| 59 |
-

|
| 60 |
-
|
| 61 |
-
**Verification of Unique Content:**
|
| 62 |
-
We also manually verified that identical titles possess distinctly different recipe instructions:
|
| 63 |
-

|
| 64 |
-
|
| 65 |
-
### 2. Word Count Filtering
|
| 66 |
-
To ensure data quality, we inspected recipes with low word counts (shorter than 40 and 20 words).
|
| 67 |
-
* **Findings:** Entries with fewer than 20 words were consistently invalid (e.g., containing only "(INVALID INPUT PROVIDED)").
|
| 68 |
-
* **Action:** These rows were dropped from the dataset.
|
| 69 |
-
|
| 70 |
-
### 3. AI Error Detection
|
| 71 |
-
We implemented a validation step to identify and remove "non-sentient" artifacts—AI refusals or system error messages that contain no culinary value.
|
| 72 |
-
* **Method:** We scanned for key error phrases such as "AI helper," "Invalid request," or apology text.
|
| 73 |
-
* **Example Identified:** *"The instruction to generate an example recipe in a specific format has been misunderstood... I apologize for the oversight."*
|
| 74 |
-
* **Action:** A total of **16 recipes** were flagged and removed to ensure a clean dataset.
|
| 75 |
-
|
| 76 |
-
### 4. Instruction Variety
|
| 77 |
-
Finally, we checked for duplications specifically within the `Instructions` column. This confirmed that the generation model produced a true variety of cooking methods and steps, rather than repeating generic text blocks.
|
| 78 |
-
|
| 79 |
-
### 5. Data Visualization
|
| 80 |
-
|
| 81 |
-

|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-

|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-

|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-

|
| 94 |
-
|
| 95 |
-
|
| 96 |
-

|
| 97 |
-
|
| 98 |
-
### Outliers
|
| 99 |
-

|
| 100 |
-
1. **Word Count Analysis**
|
| 101 |
-
Outliers & Distribution: The data is heavily concentrated between 80 and 95 words, with significant low-end outliers suggesting a subset of very brief recipes.
|
| 102 |
-
|
| 103 |
-
Contextual Variation: This spread is natural, as "short and descriptive" recipes focus on efficiency, while those nearing the 120-word mark likely include "the story behind the recipe" or extra cultural context.
|
| 104 |
-
|
| 105 |
-
2. **Character Length Analysis**
|
| 106 |
-
Outliers & Distribution: The character count distribution is smoother and centered around 500–550 characters, though it shows more high-end outliers than the word count plot.
|
| 107 |
-
|
| 108 |
-
Vocabulary Density: These outliers highlight the difference between recipes using simple language and those utilizing longer, technical culinary terms or descriptive narratives.
|
| 109 |
-
|
| 110 |
-
These data points should ***not*** be classified as technical outliers because they reflect the natural stylistic variance found in real-world culinary writing. Shorter entries correspond to concise, descriptive instructions focused purely on efficiency, while longer entries rightfully include narrative elements or the "story behind the recipe." Therefore, this spread in word and character counts indicates a healthy, diverse dataset that mirrors authentic human authorship rather than data quality errors.
|
| 111 |
-
|
| 112 |
|
| 113 |
---
|
| 114 |
|
|
|
|
| 46 |
---
|
| 47 |
|
| 48 |
## Part 2: Exploratory Data Analysis (EDA)
|
| 49 |
+
Find the dataset and EDA on: https://huggingface.co/datasets/Liori25/10k_recipes
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
---
|
| 52 |
|