Spaces:

PatternGroup5
/

pattern

Sleeping

pattern / pages /4_Report.py

azaher1215

integrating md file with report.py

b8416be 5 months ago

8.63 kB

	import streamlit as st

	def render_report():
	st.title("Group 5: Term Project Report")

	# Title Page Information
	st.markdown("""
	Course: CSE 555 — Introduction to Pattern Recognition
	Authors: Saksham Lakhera and Ahmed Zaher
	Date: July 2025
	""")

	# Abstract
	st.header("Abstract")

	st.subheader("NLP Engineering Perspective")
	st.markdown("""
	This project addresses the challenge of improving recipe recommendation systems through
	advanced semantic search capabilities using transformer-based language models. Traditional
	keyword-based search methods often fail to capture the nuanced relationships between
	ingredients, cooking techniques, and user preferences in culinary contexts.

	Our approach leverages BERT (Bidirectional Encoder Representations from Transformers)
	fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content.
	We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
	by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
	the BERT architecture.

	The model was fine-tuned to learn contextual embeddings that capture semantic relationships
	between ingredients and tags. At inference time we generate embeddings for all recipes in our
	dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes
	for a user query.
	""")

	# Introduction
	st.header("Introduction")
	st.markdown("""
	This term project serves primarily as an educational exercise aimed at giving students
	end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic
	recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can
	substantially improve retrieval quality over simple keyword matching.

	Key Contributions:
	- A cleaned, category-labelled recipe subset of 15,000 recipes
	- Training scripts that yield domain-adapted contextual embeddings
	- A production-ready retrieval service that returns top-K most relevant recipes
	- Comparative evaluation against classical baselines
	""")

	# Dataset and Preprocessing
	st.header("Dataset and Pre-processing")

	st.subheader("Data Sources")
	st.markdown("""
	The project draws from two CSV files:
	- Raw_recipes.csv – 231,637 rows, one per recipe with columns: id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients
	- Raw_interactions.csv – user feedback containing recipe_id, user_id, rating (1-5), review text
	""")

	st.subheader("Corpus Filtering and Subset Selection")
	st.markdown("""
	1. Invalid rows removed – recipes with empty ingredient lists, missing tags, or fewer than three total tags
	2. Random sampling – 15,000 recipes selected for NLP fine-tuning
	3. Positive/negative pairs – generated for contrastive learning using ratings and tag similarity
	4. Train/test split – 80/20 stratified split (12,000/3,000 pairs)
	""")

	st.subheader("Text Pre-processing Pipeline")
	st.markdown("""
	- Lower-casing & punctuation removal – normalized to lowercase, special characters stripped
	- Stop-descriptor removal – culinary modifiers (fresh, chopped, minced) and measurements removed
	- Ingredient ordering – re-ordered into sequence: protein → vegetables → grains → dairy → other
	- Tag normalization – mapped to six canonical slots: cuisine, course, main-ingredient, dietary, difficulty, occasion
	- Tokenization – standard bert-base-uncased WordPiece tokenizer, sequences truncated/padded to 128 tokens
	""")

	# Methodology
	st.header("Methodology")

	st.subheader("Model Architecture")
	st.markdown("""
	- Base Model: `bert-base-uncased` checkpoint
	- Additional Layers: Single linear classification layer (768 → 1) with dropout (p = 0.1)
	- Training Objective: Triplet-margin loss with margin of 1.0
	""")

	st.subheader("Hyperparameters")
	col1, col2 = st.columns(2)
	with col1:
	st.markdown("""
	- Batch size: 8
	- Max sequence length: 128 tokens
	- Learning rate: 2 × 10⁻⁵
	- Weight decay: 0.01
	""")
	with col2:
	st.markdown("""
	- Optimizer: AdamW
	- Epochs: 3
	- Hardware: Google Colab A100 GPU (40 GB VRAM)
	- Training time: ~75 minutes per run
	""")

	# Mathematical Formulations
	st.header("Mathematical Formulations")

	st.subheader("Query Embedding and Similarity Calculation")
	st.latex(r"""
	\text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\\|\hat{q}\\|\\|\hat{r}_i\\|}
	""")
	st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")

	st.subheader("Final Score Calculation")
	st.latex(r"""
	\text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
	""")

	# Results
	st.header("Results")

	st.subheader("Training and Validation Loss")
	results_data = {
	"Run": [1, 2, 3, 4],
	"Configuration": [
	"Raw, no cleaning/ordering",
	"Cleaned text, unordered",
	"Cleaned text + dropout",
	"Cleaned text + dropout + ordering"
	],
	"Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
	"Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
	}
	st.table(results_data)

	st.markdown("""
	Key Finding: Run 4 (cleaned text + dropout + ordering) achieved the best balance
	between low validation loss and meaningful retrieval quality.
	""")

	st.subheader("Qualitative Retrieval Examples")
	st.markdown("""
	Query: "beef steak dinner"
	- Run 1 (Raw): to die for crock pot roast, crock pot chicken with black beans
	- Run 4 (Final): grilled garlic steak dinner, classic beef steak au poivre

	Query: "chicken italian pasta"
	- Run 1 (Raw): to die for crock pot roast, crock pot chicken with black beans
	- Run 4 (Final): creamy tuscan chicken pasta, italian chicken penne bake

	Query: "vegetarian salad healthy"
	- Run 1 (Raw): (irrelevant hits)
	- Run 4 (Final): kale quinoa power salad, superfood spinach & berry salad
	""")

	# Discussion and Conclusion
	st.header("Discussion and Conclusion")
	st.markdown("""
	The experimental evidence underscores the importance of disciplined pre-processing when
	adapting large language models to niche domains. The breakthrough came with ingredient-ordering
	(protein → vegetables → grains → dairy → other) which supplied consistent positional signals.

	Key Achievements:
	- End-to-end recipe recommendation system with semantic search
	- Sub-second latency across 231k recipes
	- Meaningful semantic understanding of culinary content
	- Reproducible blueprint for domain-specific NLP applications

	Limitations:
	- Private dataset relatively small (15k samples) compared to public corpora
	- Minimal hyperparameter search conducted
	- Single-machine deployment tested
	""")

	# Technical Specifications
	st.header("Technical Specifications")
	col1, col2 = st.columns(2)
	with col1:
	st.markdown("""
	Dataset:
	- Total Recipes: 231,630
	- Training Set: 15,000 recipes
	- Average Tags per Recipe: ~6
	- Ingredients per Recipe: 3-20
	""")
	with col2:
	st.markdown("""
	Infrastructure:
	- Python 3.10
	- PyTorch 2.1 (CUDA 11.8)
	- Transformers 4.38
	- Google Colab A100 GPU
	""")

	# References
	st.header("References")
	st.markdown("""
	[1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.

	[2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.

	[3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.

	[4] Hugging Face, "BERT Model Documentation," 2024.
	""")

	st.markdown("---")
	st.markdown("© 2025 CSE 555 Term Project. All rights reserved.")

	# Render the report
	render_report()