Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| def render_report(): | |
| st.title("Group 5: Term Project Report") | |
| # Title Page Information | |
| st.markdown(""" | |
| **Course:** CSE 555 β Introduction to Pattern Recognition | |
| **Authors:** Saksham Lakhera and Ahmed Zaher | |
| **Date:** July 2025 | |
| """) | |
| # Abstract | |
| st.header("Abstract") | |
| st.subheader("NLP Engineering Perspective") | |
| st.markdown(""" | |
| This project addresses the challenge of improving recipe recommendation systems through | |
| advanced semantic search capabilities using transformer-based language models. Traditional | |
| keyword-based search methods often fail to capture the nuanced relationships between | |
| ingredients, cooking techniques, and user preferences in culinary contexts. | |
| Our approach leverages BERT (Bidirectional Encoder Representations from Transformers) | |
| fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content. | |
| We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized | |
| by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for | |
| the BERT architecture. | |
| The model was fine-tuned to learn contextual embeddings that capture semantic relationships | |
| between ingredients and tags. At inference time we generate embeddings for all recipes in our | |
| dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes | |
| for a user query. | |
| """) | |
| # Introduction | |
| st.header("Introduction") | |
| st.markdown(""" | |
| This term project serves primarily as an educational exercise aimed at giving students | |
| end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic | |
| recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can | |
| substantially improve retrieval quality over simple keyword matching. | |
| **Key Contributions:** | |
| - A cleaned, category-labelled recipe subset of 15,000 recipes | |
| - Training scripts that yield domain-adapted contextual embeddings | |
| - A production-ready retrieval service that returns top-K most relevant recipes | |
| - Comparative evaluation against classical baselines | |
| """) | |
| # Dataset and Preprocessing | |
| st.header("Dataset and Pre-processing") | |
| st.subheader("Data Sources") | |
| st.markdown(""" | |
| The project draws from two CSV files: | |
| - **Raw_recipes.csv** β 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients* | |
| - **Raw_interactions.csv** β user feedback containing *recipe_id, user_id, rating (1-5), review text* | |
| """) | |
| st.subheader("Corpus Filtering and Subset Selection") | |
| st.markdown(""" | |
| 1. **Invalid rows removed** β recipes with empty ingredient lists, missing tags, or fewer than three total tags | |
| 2. **Random sampling** β 15,000 recipes selected for NLP fine-tuning | |
| 3. **Positive/negative pairs** β generated for contrastive learning using ratings and tag similarity | |
| 4. **Train/test split** β 80/20 stratified split (12,000/3,000 pairs) | |
| """) | |
| st.subheader("Text Pre-processing Pipeline") | |
| st.markdown(""" | |
| - **Lower-casing & punctuation removal** β normalized to lowercase, special characters stripped | |
| - **Stop-descriptor removal** β culinary modifiers (*fresh, chopped, minced*) and measurements removed | |
| - **Ingredient ordering** β re-ordered into sequence: **protein β vegetables β grains β dairy β other** | |
| - **Tag normalization** β mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion* | |
| - **Tokenization** β standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens | |
| """) | |
| # Methodology | |
| st.header("Methodology") | |
| st.subheader("Model Architecture") | |
| st.markdown(""" | |
| - **Base Model:** `bert-base-uncased` checkpoint | |
| - **Additional Layers:** Single linear classification layer (768 β 1) with dropout (p = 0.1) | |
| - **Training Objective:** Triplet-margin loss with margin of 1.0 | |
| """) | |
| st.subheader("Hyperparameters") | |
| col1, col2 = st.columns(2) | |
| with col1: | |
| st.markdown(""" | |
| - **Batch size:** 8 | |
| - **Max sequence length:** 128 tokens | |
| - **Learning rate:** 2 Γ 10β»β΅ | |
| - **Weight decay:** 0.01 | |
| """) | |
| with col2: | |
| st.markdown(""" | |
| - **Optimizer:** AdamW | |
| - **Epochs:** 3 | |
| - **Hardware:** Google Colab A100 GPU (40 GB VRAM) | |
| - **Training time:** ~75 minutes per run | |
| """) | |
| # Mathematical Formulations | |
| st.header("Mathematical Formulations") | |
| st.subheader("Query Embedding and Similarity Calculation") | |
| st.latex(r""" | |
| \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|} | |
| """) | |
| st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.") | |
| st.subheader("Final Score Calculation") | |
| st.latex(r""" | |
| \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i | |
| """) | |
| # Results | |
| st.header("Results") | |
| st.subheader("Training and Validation Loss") | |
| results_data = { | |
| "Run": [1, 2, 3, 4], | |
| "Configuration": [ | |
| "Raw, no cleaning/ordering", | |
| "Cleaned text, unordered", | |
| "Cleaned text + dropout", | |
| "Cleaned text + dropout + ordering" | |
| ], | |
| "Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119], | |
| "Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067] | |
| } | |
| st.table(results_data) | |
| st.markdown(""" | |
| **Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance | |
| between low validation loss and meaningful retrieval quality. | |
| """) | |
| st.subheader("Qualitative Retrieval Examples") | |
| st.markdown(""" | |
| **Query: "beef steak dinner"** | |
| - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans* | |
| - Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre* | |
| **Query: "chicken italian pasta"** | |
| - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans* | |
| - Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake* | |
| **Query: "vegetarian salad healthy"** | |
| - Run 1 (Raw): (irrelevant hits) | |
| - Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad* | |
| """) | |
| # Discussion and Conclusion | |
| st.header("Discussion and Conclusion") | |
| st.markdown(""" | |
| The experimental evidence underscores the importance of disciplined pre-processing when | |
| adapting large language models to niche domains. The breakthrough came with **ingredient-ordering** | |
| (protein β vegetables β grains β dairy β other) which supplied consistent positional signals. | |
| **Key Achievements:** | |
| - End-to-end recipe recommendation system with semantic search | |
| - Sub-second latency across 231k recipes | |
| - Meaningful semantic understanding of culinary content | |
| - Reproducible blueprint for domain-specific NLP applications | |
| **Limitations:** | |
| - Private dataset relatively small (15k samples) compared to public corpora | |
| - Minimal hyperparameter search conducted | |
| - Single-machine deployment tested | |
| """) | |
| # Technical Specifications | |
| st.header("Technical Specifications") | |
| col1, col2 = st.columns(2) | |
| with col1: | |
| st.markdown(""" | |
| **Dataset:** | |
| - Total Recipes: 231,630 | |
| - Training Set: 15,000 recipes | |
| - Average Tags per Recipe: ~6 | |
| - Ingredients per Recipe: 3-20 | |
| """) | |
| with col2: | |
| st.markdown(""" | |
| **Infrastructure:** | |
| - Python 3.10 | |
| - PyTorch 2.1 (CUDA 11.8) | |
| - Transformers 4.38 | |
| - Google Colab A100 GPU | |
| """) | |
| # References | |
| st.header("References") | |
| st.markdown(""" | |
| [1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017. | |
| [2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019. | |
| [3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019. | |
| [4] Hugging Face, "BERT Model Documentation," 2024. | |
| """) | |
| st.markdown("---") | |
| st.markdown("Β© 2025 CSE 555 Term Project. All rights reserved.") | |
| # Render the report | |
| render_report() | |