Liori25 commited on
Commit
777a2ce
·
verified ·
1 Parent(s): 8198de4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -0
README.md CHANGED
@@ -94,3 +94,37 @@ Outliers & Distribution: The character count distribution is smoother and center
94
  Vocabulary Density: These outliers highlight the difference between recipes using simple language and those utilizing longer, technical culinary terms or descriptive narratives.
95
 
96
  These data points should ***not*** be classified as technical outliers because they reflect the natural stylistic variance found in real-world culinary writing. Shorter entries correspond to concise, descriptive instructions focused purely on efficiency, while longer entries rightfully include narrative elements or the "story behind the recipe." Therefore, this spread in word and character counts indicates a healthy, diverse dataset that mirrors authentic human authorship rather than data quality errors.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  Vocabulary Density: These outliers highlight the difference between recipes using simple language and those utilizing longer, technical culinary terms or descriptive narratives.
95
 
96
  These data points should ***not*** be classified as technical outliers because they reflect the natural stylistic variance found in real-world culinary writing. Shorter entries correspond to concise, descriptive instructions focused purely on efficiency, while longer entries rightfully include narrative elements or the "story behind the recipe." Therefore, this spread in word and character counts indicates a healthy, diverse dataset that mirrors authentic human authorship rather than data quality errors.
97
+
98
+
99
+ ---
100
+
101
+ Part 3: Embeddings
102
+ We have selected three distinct Transformer models to evaluate the trade-off between semantic understanding and computational efficiency for our recipe recommendation engine:
103
+
104
+ **sentence-transformers/all-MiniLM-L6-v2** (The Baseline): Chosen for its extreme speed and compact size (80MB). It represents the industry standard for lightweight CPU-based inference, serving as our baseline for "maximum efficiency."
105
+
106
+ **sentence-transformers/all-mpnet-base-v2** (The Quality Benchmark): Chosen as the high-accuracy anchor. While significantly larger (420MB) and slower, it consistently ranks highest on semantic search benchmarks, allowing us to measure how much quality we might sacrifice for speed.
107
+
108
+ **AAI/bge-small-en-v1.5**: Chosen as a potential "best of both worlds" solution. This newer model utilizes advanced pre-training techniques to achieve accuracy comparable to MPNet while maintaining a small footprint (133MB) similar to MiniLM, making it a strong candidate for optimal performance.
109
+
110
+
111
+
112
+
113
+ ## Part 3: Semantic Search & Model Selection
114
+
115
+ ### 1. Understanding the Similarity Score
116
+ To retrieve the best matching recipes, the system uses vector embeddings. The process works as follows:
117
+
118
+ 1. **Vectorization:** The model transforms the user's search text (e.g., *"creamy italian rice"*) into a numerical array, which we define as **Vector A**.
119
+ 2. **Dataset Mapping:** It performs the same operation on the recipe database (e.g., *"Mushroom Risotto"*), creating **Vector B** for each entry.
120
+ 3. **Comparison:** The function compares the single query vector against 2,000 recipe vectors simultaneously.
121
+ 4. **Scoring:** It returns an array of similarity scores, such as `[0.12, 0.05, 0.88, 0.15...]`.
122
+ 5. **Retrieval:** The algorithm identifies the highest value (e.g., `0.88`) and retrieves the corresponding recipe as the "winner."
123
+
124
+ ### 2. Embedding Model Selection
125
+ We selected **BAAI/bge-small-en-v1.5** as the optimal embedding model for our recipe dataset.
126
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/6910977ace661438b728d763/qtXVG31H9aWg3B1XZ1ZU1.png)
127
+
128
+ * **Performance:** Crucially, it achieved the **highest similarity score** in our evaluation, demonstrating superior semantic understanding compared to the faster but less accurate `all-MiniLM-L6-v2`.
129
+ * **Efficiency:** It matched the precision of the resource-heavy `all-mpnet-base-v2` (which requires 420 MB) while maintaining a significantly lighter footprint.
130
+ * **Conclusion:** This specific balance allows our system to deliver the most relevant recipe recommendations without compromising on computational efficiency.