Spaces:

aliarafat-stack-ml
/

ml-demo

Sleeping

App Files Files Community

ml-demo / pages /7_Rec_Models.py

aliarafat-stack-ml

first commit

f3a6f24 about 2 months ago

raw

history blame contribute delete

95.5 kB

	import streamlit as st
	import pandas as pd
	import numpy as np
	import plotly.graph_objects as go

	from utils.rec_data_loader import build_interaction_matrix, get_rec_train_test
	from utils.rec_models import (
	train_svd, train_als, train_sgd, train_nmf, train_item_cf,
	get_top_n_recommendations, evaluate_recommendations,
	)

	st.set_page_config(page_title="Rec Models", page_icon="🎯", layout="wide")
	st.title("Product Recommendation — Models")
	st.markdown("---")

	tab_how, tab_modern, tab_compare, tab_demo = st.tabs([
	"How The Algorithms Work", "Modern Architecture & Industry", "Model Comparison", "Interactive Recommendations"
	])

	# ═══════════════════════════════════════════════════════════════════════════════
	# Tab 1 — How The Algorithms Work
	# ═══════════════════════════════════════════════════════════════════════════════
	with tab_how:
	st.subheader("Recommendation Algorithms — A Visual Guide")
	st.markdown(
	"All these algorithms solve the same problem: fill in the empty cells "
	"of the user-item matrix. They just approach it differently."
	)

	# ── Matrix Factorization overview ────────────────────────────────────────
	st.markdown("---")
	st.markdown("### The Core Idea: Matrix Factorization")
	st.markdown(
	"Matrix factorization is like discovering the hidden reasons behind purchases."
	)

	st.graphviz_chart("""
	digraph mf {
	rankdir=LR
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	R [label="User-Item Matrix\\nR (4338 × 3665)\\nMostly empty", fillcolor="#dbeafe", color="#3b82f6"]
	approx [label="≈", shape=plaintext, fontsize=24]
	U [label="User Matrix\\nU (4338 × k)\\nEach user as k\\nlatent preferences", fillcolor="#d1fae5", color="#10b981"]
	times [label="×", shape=plaintext, fontsize=24]
	V [label="Item Matrix\\nV (k × 3665)\\nEach item as k\\nlatent attributes", fillcolor="#fce7f3", color="#ec4899"]

	R -> approx [arrowhead=none]
	approx -> U [arrowhead=none]
	U -> times [arrowhead=none]
	times -> V [arrowhead=none]
	}
	""")

	st.markdown(
	"""
	Example:

	Suppose there are 3 hidden factors: "kitchen items", "vintage style", "party supplies".
	- User profile (U): Customer 12345 = [0.9 kitchen, 0.1 vintage, 0.7 party]
	- Item profile (V): "Cake Stand" = [0.8 kitchen, 0.3 vintage, 0.5 party]
	- Predicted score = 0.9×0.8 + 0.1×0.3 + 0.7×0.5 = 1.10 → strong recommendation!

	The model discovers these latent factors automatically from the purchase patterns.
	We set the number of factors (k), and the algorithm figures out what they represent.
	"""
	)

	with st.expander("Where do the vectors come from? (Step-by-step from scratch)", expanded=False):
	st.markdown(
	"""
	At the start, no vectors exist. We only have raw purchase data:

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Customer A \| 5 \| 3 \| ? \| 1 \|
	\| Customer B \| 4 \| ? \| 2 \| ? \|
	\| Customer C \| ? \| 4 \| 5 \| 3 \|

	The `?` entries are what we want to predict.

	---

	Step 1 — Create random vectors

	We just assign random numbers. With k=2 factors:

	\| User \| f1 \| f2 \|
	\|---\|---\|---\|
	\| A \| 0.3 \| 0.7 \|
	\| B \| 0.5 \| 0.2 \|
	\| C \| 0.1 \| 0.9 \|

	\| Item \| f1 \| f2 \|
	\|---\|---\|---\|
	\| Cake Stand \| 0.4 \| 0.6 \|
	\| Candles \| 0.8 \| 0.3 \|
	\| Baking Set \| 0.2 \| 0.7 \|
	\| Bunting \| 0.6 \| 0.1 \|

	These are random guesses. They are wrong. That's the starting point for training.

	---

	Step 2 — Predict using the full formula

	The raw dot product gives small numbers (e.g. 0.54), but purchase counts are on
	a different scale. So the full prediction formula includes bias terms:
	"""
	)
	st.latex(r"\text{predicted} = \underbrace{\mu}_{\text{global mean}} + \underbrace{b_u}_{\text{user bias}} + \underbrace{b_i}_{\text{item bias}} + \underbrace{\vec{u} \cdot \vec{v}}_{\text{dot product}}")
	st.markdown(
	"""
	\| Term \| Value \| What it captures \|
	\|---\|---\|---\|
	\| μ (global mean) \| 3.4 \| Average of all known values in the matrix \|
	\| bᵤ (user bias) \| +0.5 \| Customer A buys more than the average customer \|
	\| bᵢ (item bias) \| +0.4 \| Cake Stand is purchased more than the average item \|
	\| dot product \| 0.54 \| Specific affinity between this user and this item \|

	`predicted = 3.4 + 0.5 + 0.4 + 0.54 = 4.84 ≈ 5` ✓

	Without biases, the model would see enormous errors during early training
	and the vectors would take much longer to converge — or might not converge at all.

	---

	Step 3 — Calculate error, adjust, repeat

	`error = actual − predicted = 5 − 4.84 = 0.16`

	The algorithm nudges the vectors and biases slightly to reduce this error,
	then moves to the next rating and does the same. After thousands of passes
	(epochs), the vectors converge to values that accurately reconstruct all
	the known ratings — and can predict the `?` cells.

	The individual algorithm walkthroughs below show exactly how each method
	does this differently.
	"""
	)

	# ── SVD ──────────────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### 1. SVD (Singular Value Decomposition)")
	st.markdown("The classic matrix factorization technique from linear algebra.")

	st.graphviz_chart("""
	digraph svd {
	rankdir=LR
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	R [label="R\\nUser-Item\\nMatrix", fillcolor="#dbeafe", color="#3b82f6"]
	eq [label="=", shape=plaintext, fontsize=24]
	U [label="U\\nUser factors\\n(users × k)", fillcolor="#d1fae5", color="#10b981"]
	S [label="Σ\\nSingular values\\n(k × k diagonal)\\nimportance weights", fillcolor="#fef3c7", color="#f59e0b"]
	Vt [label="Vᵀ\\nItem factors\\n(k × items)", fillcolor="#fce7f3", color="#ec4899"]

	R -> eq [arrowhead=none]
	eq -> U [arrowhead=none]
	U -> S [label=" × "]
	S -> Vt [label=" × "]
	}
	""")

	st.markdown(
	"""
	How it works:
	1. SVD decomposes the user-item matrix R into three matrices: U × Σ × Vᵀ
	2. U captures user preferences in k-dimensional "latent space"
	3. Σ (Sigma) is a diagonal matrix of importance weights — the first few factors
	capture the most important patterns, the rest are noise
	4. Vᵀ captures item characteristics in the same latent space
	5. We keep only the top k factors (truncated SVD) — this filters noise and
	fills in the missing values with predictions

	Strengths: Mathematically elegant, well-understood, fast with sparse matrices.
	Weaknesses: Treats missing entries as zero (problematic for implicit feedback —
	"didn't buy" ≠ "doesn't like").
	"""
	)

	with st.expander("Step-by-step numerical walkthrough — SVD", expanded=False):
	st.markdown(
	"""
	Starting point — the user-item matrix (purchase counts):

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Customer A \| 5 \| 3 \| ? \| 1 \|
	\| Customer B \| 4 \| ? \| 2 \| ? \|
	\| Customer C \| ? \| 4 \| 5 \| 3 \|

	`?` = never purchased (stored as 0 in the matrix).

	Step 1 — Decompose the matrix

	SVD splits this matrix into three pieces: R = U × Σ × Vᵀ

	With k=2 factors, we get:
	- U (3 users × 2 factors) — each user as 2 numbers
	- Σ (2 × 2 diagonal) — importance weights
	- Vᵀ (2 factors × 4 items) — each item as 2 numbers

	SVD computes this directly using linear algebra (no iterative training).

	Step 2 — The decomposition gives us vectors

	After SVD:

	\| User \| Factor 1 \| Factor 2 \|
	\|---\|---\|---\|
	\| A \| 0.72 \| -0.31 \|
	\| B \| 0.58 \| -0.44 \|
	\| C \| 0.85 \| 0.52 \|

	\| Item \| Factor 1 \| Factor 2 \|
	\|---\|---\|---\|
	\| Cake Stand \| 0.68 \| -0.25 \|
	\| Candles \| 0.55 \| 0.41 \|
	\| Baking Set \| 0.71 \| 0.50 \|
	\| Bunting \| 0.43 \| 0.38 \|

	Step 3 — Predict missing values

	Customer B, Baking Set (was `?`):

	`predicted = [0.58, -0.44] · [0.71, 0.50] = (0.58×0.71) + (-0.44×0.50) = 0.41 - 0.22 = 0.19`

	Low score → weak recommendation.

	Customer A, Baking Set (was `?`):

	`predicted = [0.72, -0.31] · [0.71, 0.50] = (0.72×0.71) + (-0.31×0.50) = 0.51 - 0.16 = 0.36`

	Higher score → stronger recommendation than for B.

	Key point: SVD fills the `?` cells using the learned vectors. The math
	ensures that users with similar purchase patterns get similar vectors, and
	therefore similar predictions for unseen items.

	Limitation: SVD treats every `?` as 0 during decomposition. So "never bought"
	is treated the same as "bought 0 times" — which loses the distinction between
	"not interested" and "never seen this product."
	"""
	)

	# ── ALS ──────────────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### 2. ALS (Alternating Least Squares)")
	st.markdown("The industry standard for implicit feedback — used at Spotify and Netflix.")

	st.graphviz_chart("""
	digraph als {
	rankdir=TB
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	init [label="Initialize\\nRandom user & item\\nfactor matrices", fillcolor="#dbeafe", color="#3b82f6"]
	fix_u [label="Fix User factors\\nOptimize Item factors\\n(least squares)", fillcolor="#d1fae5", color="#10b981"]
	fix_i [label="Fix Item factors\\nOptimize User factors\\n(least squares)", fillcolor="#fce7f3", color="#ec4899"]
	check [label="Check convergence\\n(error decreasing?)", fillcolor="#fef3c7", color="#f59e0b"]
	done [label="Final U and V\\nPredict: U × Vᵀ", fillcolor="#e0e7ff", color="#6366f1"]

	init -> fix_u -> fix_i -> check
	check -> fix_u [label="not converged", style=dashed]
	check -> done [label="converged"]
	}
	""")

	st.markdown(
	"""
	How it works:
	1. Start with random user factors (U) and item factors (V)
	2. Fix U, solve for the best V (this is a standard least squares problem)
	3. Fix V, solve for the best U (again, least squares)
	4. Alternate between steps 2 and 3 until the error stops decreasing
	5. Multiply U × Vᵀ to get predicted scores for every user-item pair

	Why "Alternating"? Optimizing U and V simultaneously is hard (non-convex).
	But if you fix one, optimizing the other becomes a simple, solvable least-squares
	problem. By alternating, we make steady progress toward a good solution.

	Key advantage for implicit feedback: ALS treats observed interactions (purchases)
	and unobserved ones (empty cells) differently. It assigns confidence weights —
	a purchase of 10 units gets higher confidence than 1 unit, while empty cells get
	low (but non-zero) confidence. This is exactly right for our e-commerce data.

	Strengths: Designed for implicit data, parallelizable, scales to millions of users.
	Weaknesses: More complex to implement, requires tuning regularization.
	"""
	)

	with st.expander("Step-by-step numerical walkthrough — How ALS Actually Works", expanded=False):
	st.markdown(
	"""
	ALS (Alternating Least Squares) is different from gradient descent methods.
	Let me show you the real ALS algorithm with numbers.

	Starting point — same matrix:

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Customer A \| 5 \| 3 \| ? \| 1 \|
	\| Customer B \| 4 \| ? \| 2 \| ? \|
	\| Customer C \| ? \| 4 \| 5 \| 3 \|

	---

	Step 1 — Initialize random vectors (k=2 factors)

	\| User \| f1 \| f2 \|
	\|---\|---\|---\|
	\| A \| 0.3 \| 0.7 \|
	\| B \| 0.5 \| 0.2 \|
	\| C \| 0.1 \| 0.9 \|

	\| Item \| f1 \| f2 \|
	\|---\|---\|---\|
	\| Cake Stand \| 0.4 \| 0.6 \|
	\| Candles \| 0.8 \| 0.3 \|
	\| Baking Set \| 0.2 \| 0.7 \|
	\| Bunting \| 0.6 \| 0.1 \|

	---

	Step 2 — Fix User vectors, Solve for ALL Item vectors at once

	This is where ALS is different! Instead of adjusting one vector at a time,
	we solve for ALL item vectors simultaneously using a mathematical formula.

	For Cake Stand (bought by A and B):

	Users who bought it:
	- Customer A: rating=5, vector=[0.3, 0.7]
	- Customer B: rating=4, vector=[0.5, 0.2]

	Build matrices:
	```
	U = [[0.3, 0.7], (users who bought Cake Stand)
	[0.5, 0.2]]

	r = [5, 4] (their ratings)
	```

	The closed-form solution:
	"""
	)
	st.latex(r"v_{\text{CakeStand}} = (U^T U + \lambda I)^{-1} U^T r")
	st.markdown(
	"""
	With regularization λ=0.1:

	```
	U^T U = [[0.34, 0.31],
	[0.31, 0.53]]

	U^T U + λI = [[0.44, 0.31],
	[0.31, 0.63]]

	U^T r = [3.5, 4.3]

	Invert and multiply:
	v_CakeStand = [2.8, 5.1] ← New vector!
	```

	Key difference: We didn't use a learning rate or take small steps.
	We solved the equation exactly using matrix math!

	Do this for ALL items:
	- Cake Stand → [2.8, 5.1]
	- Candles → [1.9, 3.2]
	- Baking Set → [2.1, 4.8]
	- Bunting → [1.5, 2.9]

	---

	Step 3 — Fix Item vectors, Solve for ALL User vectors at once

	Now flip it! Fix the new item vectors and solve for user vectors.

	For Customer A (bought Cake Stand, Candles, Bunting):

	Items they bought:
	- Cake Stand: rating=5, vector=[2.8, 5.1]
	- Candles: rating=3, vector=[1.9, 3.2]
	- Bunting: rating=1, vector=[1.5, 2.9]

	Build matrices:
	```
	V = [[2.8, 5.1], (items Customer A bought)
	[1.9, 3.2],
	[1.5, 2.9]]

	r = [5, 3, 1] (Customer A's ratings)
	```

	Same formula:
	"""
	)
	st.latex(r"u_A = (V^T V + \lambda I)^{-1} V^T r")
	st.markdown(
	"""
	Result: `u_A = [0.82, 0.91]` ← New user vector!

	Do this for ALL users:
	- Customer A → [0.82, 0.91]
	- Customer B → [0.73, 0.45]
	- Customer C → [0.31, 0.88]

	---

	Step 4 — Alternate until convergence

	\| Iteration \| What happens \| Error \|
	\|---\|---\|---\|
	\| 1 \| Fix U, solve all V \| 3.2 \|
	\| 2 \| Fix V, solve all U \| 1.8 \|
	\| 3 \| Fix U, solve all V \| 0.9 \|
	\| 4 \| Fix V, solve all U \| 0.4 \|
	\| 5 \| Fix U, solve all V \| 0.2 \|
	\| 6 \| Fix V, solve all U \| 0.1 \|

	After ~6 iterations, vectors converge!

	(Compare this to SGD which needs 100+ iterations)

	---

	Step 5 — Make predictions

	Customer A → Baking Set (`?`):

	`predicted = [0.82, 0.91] · [2.1, 4.8] = 0.82×2.1 + 0.91×4.8 = 1.72 + 4.37 = 6.09`

	Recommend Baking Set to Customer A!

	---

	Why ALS is different:

	\| Aspect \| ALS \| SGD (shown in next section) \|
	\|---\|---\|---\|
	\| Updates \| All users at once (or all items at once) \| One rating at a time \|
	\| Math \| Matrix inversion: `(A^T A)^(-1) A^T b` \| Gradient descent: `v += α × error × u` \|
	\| Learning rate \| Not needed (exact solution) \| Required (α = 0.01) \|
	\| Iterations \| Few (~10-20) \| Many (100-200) \|
	\| Per iteration \| Slow (matrix operations) \| Fast (simple arithmetic) \|
	\| Parallelization \| Easy (each user/item independent) \| Hard (sequential updates) \|
	\| Best for \| Large-scale systems (Spotify, Netflix) \| Small datasets, research \|

	---

	Simple analogy:

	ALS: You're a teacher grading 100 students. You collect ALL math tests,
	grade them all at once using an answer key, then collect ALL English tests
	and grade those. You alternate subjects, but grade all students per subject.

	SGD: You grade one student's math test, then their English test, then
	the next student's math test, then their English test. One at a time.

	ALS is much faster when you have many students (users/items) because you
	can split the grading across multiple teachers (parallel processing).
	"""
	)

	# ── SGD ──────────────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### 3. SGD Matrix Factorization (Funk SVD)")
	st.markdown("The algorithm that powered the Netflix Prize — learns vectors one rating at a time.")

	st.graphviz_chart("""
	digraph sgd {
	rankdir=TB
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	init [label="Initialize random\\nuser vectors, item vectors,\\nuser biases, item biases", fillcolor="#dbeafe", color="#3b82f6"]
	pick [label="Pick one observed\\nrating (u, i, r)", fillcolor="#e0e7ff", color="#6366f1"]
	pred [label="Predict\\nμ + bᵤ + bᵢ + u⃗·v⃗", fillcolor="#d1fae5", color="#10b981"]
	err [label="Error = actual − predicted", fillcolor="#fef3c7", color="#f59e0b"]
	upd [label="Nudge vectors & biases\\nin direction that reduces error", fillcolor="#fce7f3", color="#ec4899"]
	check [label="More ratings\\nin this epoch?", fillcolor="#e0e7ff", color="#6366f1"]
	epoch [label="More epochs?", fillcolor="#dbeafe", color="#3b82f6"]
	done [label="Final prediction matrix\\nμ + bᵤ + bᵢ + U×Vᵀ", fillcolor="#d1fae5", color="#10b981"]

	init -> pick -> pred -> err -> upd -> check
	check -> pick [label="yes", style=dashed]
	check -> epoch [label="no (epoch done)"]
	epoch -> pick [label="yes", style=dashed]
	epoch -> done [label="no (converged)"]
	}
	""")

	st.markdown(
	"""
	How it works:
	1. Start with random user vectors, item vectors, and bias terms
	2. For each observed interaction, predict the score using the full formula:
	μ + bᵤ + bᵢ + user_vec · item_vec
	3. Compute the error (actual − predicted)
	4. Nudge the vectors and biases slightly to reduce that error (gradient descent)
	5. Repeat for every observed interaction (one epoch), then repeat for many epochs

	Why this is different from pure SVD:
	- Pure SVD decomposes the entire matrix including zeros (missing = dislike)
	- SGD only trains on observed interactions — it skips empty cells entirely
	- SGD includes bias terms that anchor predictions to the right scale

	Why "Funk SVD"? Simon Funk published this approach during the Netflix Prize
	(2006). It's not technically SVD — it's gradient descent — but the name stuck.
	This was the breakthrough that made matrix factorization practical for recommendation.

	Strengths: Handles missing data correctly, includes bias terms, flexible, well-understood.
	Weaknesses: Slow (one rating at a time), sensitive to learning rate, can overfit without regularization.
	"""
	)

	with st.expander("Step-by-step numerical walkthrough — SGD (Funk SVD)", expanded=False):
	st.markdown(
	"""
	Simple explanation first:

	SGD (Stochastic Gradient Descent) learns by looking at one purchase at a time:
	1. "Customer A bought 5 Cake Stands"
	2. Predict how many they'd buy using current vectors: "I predict 3.9"
	3. Calculate error: "I was off by 1.1 (too low)"
	4. Adjust vectors slightly: "Make Customer A's vector a bit more like Cake Stand's vector"
	5. Move to next purchase and repeat

	After seeing all purchases many times (100+ epochs), the vectors learn the patterns.

	Key insight: It's like learning to cook by tasting after each ingredient.
	You add salt, taste, adjust. Add pepper, taste, adjust. Small corrections each time.

	---

	Now the detailed math:

	Starting point — same matrix, but we only use observed entries:

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Customer A \| 5 \| 3 \| skip \| 1 \|
	\| Customer B \| 4 \| skip \| 2 \| skip \|
	\| Customer C \| skip \| 4 \| 5 \| 3 \|

	Unlike SVD, the `?` cells are ignored during training — not treated as zero.

	---

	Step 1 — Initialize everything

	Global mean: μ = (5+3+1+4+2+4+5+3) / 8 = 3.375

	Random biases (start near zero):

	\| \| Bias \|
	\|---\|---\|
	\| b_A \| +0.01 \|
	\| b_B \| -0.02 \|
	\| b_C \| +0.01 \|
	\| b_CakeStand \| +0.03 \|
	\| b_Candles \| -0.01 \|
	\| b_BakingSet \| +0.02 \|
	\| b_Bunting \| -0.01 \|

	Random vectors (k=2):

	\| User \| f1 \| f2 \|
	\|---\|---\|---\|
	\| A \| 0.30 \| 0.70 \|
	\| B \| 0.50 \| 0.20 \|
	\| C \| 0.10 \| 0.90 \|

	\| Item \| f1 \| f2 \|
	\|---\|---\|---\|
	\| Cake Stand \| 0.40 \| 0.60 \|
	\| Candles \| 0.80 \| 0.30 \|
	\| Baking Set \| 0.20 \| 0.70 \|
	\| Bunting \| 0.60 \| 0.10 \|

	---

	Step 2 — Process first rating: Customer A → Cake Stand (actual = 5)
	"""
	)
	st.latex(r"\text{pred} = \mu + b_A + b_{\text{CakeStand}} + \vec{u}_A \cdot \vec{v}_{\text{CakeStand}}")
	st.markdown(
	"""
	`pred = 3.375 + 0.01 + 0.03 + (0.30×0.40 + 0.70×0.60)`
	`= 3.375 + 0.01 + 0.03 + (0.12 + 0.42)`
	`= 3.375 + 0.04 + 0.54 = 3.955`

	`error = 5 - 3.955 = 1.045`

	---

	Step 3 — Update using learning rate α=0.01 and regularization λ=0.02
	"""
	)
	st.latex(r"b_u \leftarrow b_u + \alpha \cdot (\text{error} - \lambda \cdot b_u)")
	st.latex(r"\vec{u} \leftarrow \vec{u} + \alpha \cdot (\text{error} \cdot \vec{v} - \lambda \cdot \vec{u})")
	st.markdown(
	"""
	Bias updates:
	`b_A = 0.01 + 0.01 × (1.045 - 0.02 × 0.01) = 0.01 + 0.01044 = 0.0204`
	`b_CakeStand = 0.03 + 0.01 × (1.045 - 0.02 × 0.03) = 0.03 + 0.01044 = 0.0404`

	Vector updates:
	`A_f1 = 0.30 + 0.01 × (1.045 × 0.40 - 0.02 × 0.30) = 0.30 + 0.00412 = 0.3041`
	`A_f2 = 0.70 + 0.01 × (1.045 × 0.60 - 0.02 × 0.70) = 0.70 + 0.00613 = 0.7061`

	`CakeStand_f1 = 0.40 + 0.01 × (1.045 × 0.30 - 0.02 × 0.40) = 0.40 + 0.00306 = 0.4031`
	`CakeStand_f2 = 0.60 + 0.01 × (1.045 × 0.70 - 0.02 × 0.60) = 0.60 + 0.00720 = 0.6072`

	Small adjustments, but in the right direction. The regularization term
	(λ × current_value) prevents weights from growing too large.

	---

	Step 4 — Process next rating: Customer A → Candles (actual = 3)

	Using updated vectors for A (but original vectors for Candles):

	`pred = 3.375 + 0.0204 + (-0.01) + (0.3041×0.80 + 0.7061×0.30)`
	`= 3.375 + 0.0104 + (0.2433 + 0.2118) = 3.375 + 0.0104 + 0.4551 = 3.840`

	`error = 3 - 3.840 = -0.840` (prediction too high — nudge down)

	Updates push vectors in the opposite direction this time.

	---

	Step 5 — After one epoch (all 8 ratings processed)

	\| Rating processed \| Error \|
	\|---\|---\|
	\| A → Cake Stand \| +1.045 \|
	\| A → Candles \| -0.840 \|
	\| A → Bunting \| -2.125 \|
	\| B → Cake Stand \| +0.612 \|
	\| B → Baking Set \| -1.389 \|
	\| C → Candles \| +0.738 \|
	\| C → Baking Set \| +1.455 \|
	\| C → Bunting \| -0.204 \|

	Average absolute error ≈ 1.05

	---

	Step 6 — After many epochs

	\| Epoch \| Avg Absolute Error \|
	\|---\|---\|
	\| 1 \| 1.05 \|
	\| 10 \| 0.52 \|
	\| 50 \| 0.12 \|
	\| 100 \| 0.04 \|

	---

	Step 7 — Predict missing values

	Customer A → Baking Set (`?`):
	`pred = μ + b_A + b_BakingSet + u_A · v_BakingSet`

	With trained values: `= 3.375 + 0.45 + 0.38 + 0.62 = 4.83` → recommend!

	Customer B → Bunting (`?`):
	`= 3.375 + (-0.15) + (-0.28) + 0.31 = 3.26` → moderate recommendation

	The bias terms are critical — they handle the difference between
	heavy buyers (high b_u) and light buyers (low b_u), and popular
	items (high b_i) versus niche items (low b_i).
	"""
	)

	# ── Simple Comparison: ALS vs SGD ────────────────────────────────────────
	st.markdown("---")
	with st.expander("🎯 ALS vs SGD — Simple Comparison for Non-Technical Audience", expanded=False):
	st.markdown(
	"""
	Both ALS and SGD learn user and item "profiles" (vectors) to make recommendations.
	They both end up with similar results, but they learn differently.

	---

	### The Restaurant Analogy

	Scenario: You're training staff to predict how much customers will tip.
	You have 100 customers and 50 waiters. Each customer-waiter pair has a tip amount.

	SGD approach (One at a time):
	- Look at Customer 1 with Waiter A: tip was $5, you predicted $3
	- "Oops, too low! Adjust Customer 1's 'generosity score' up a bit"
	- "Also adjust Waiter A's 'service score' up a bit"
	- Now look at Customer 1 with Waiter B: tip was $2, you predicted $4
	- "Oops, too high! Adjust both scores down a bit"
	- Continue one interaction at a time through all 5,000 customer-waiter pairs
	- Repeat this process 100 times until predictions are accurate

	ALS approach (All at once, alternating):
	- Round 1: Fix all customer scores, solve for ALL waiter scores at once using math
	- Round 2: Fix all waiter scores, solve for ALL customer scores at once using math
	- Round 3: Fix all customer scores, solve for ALL waiter scores at once
	- After ~10 rounds, you're done!

	---

	### Key Differences in Plain English

	\| Aspect \| SGD \| ALS \|
	\|--------\|-----\|-----\|
	\| Learning style \| One interaction at a time, like studying flashcards \| All at once per round, like grading all tests together \|
	\| Speed per round \| Fast (simple math) \| Slow (complex matrix math) \|
	\| Rounds needed \| Many (100-200) \| Few (10-20) \|
	\| Total time \| Moderate \| Fast for large datasets \|
	\| Parallelization \| Hard (must go in order) \| Easy (can split work across computers) \|
	\| Best for \| Small datasets, research, when you need bias terms \| Large-scale production (Spotify, Netflix) \|

	---

	### When to Use Which?

	Use SGD when:
	- You have a small dataset (< 100K interactions)
	- You're doing research or prototyping
	- You need explicit bias terms (μ + b_u + b_i)
	- You want to understand the algorithm step-by-step

	Use ALS when:
	- You have a large dataset (millions of interactions)
	- You're building a production system
	- You have multiple servers/cores to parallelize
	- Your data is implicit feedback (purchases, clicks, not ratings)

	---

	### The Math Difference (For Technical Readers)

	SGD update rule:
	```
	error = actual - predicted
	u ← u + α × (error × v - λ × u)
	v ← v + α × (error × u - λ × v)
	```
	Small steps guided by the error gradient.

	ALS update rule:
	```
	Fix U, solve for all V:
	v_i = (U^T U + λI)^(-1) U^T r_i

	Fix V, solve for all U:
	u_j = (V^T V + λI)^(-1) V^T r_j
	```
	Closed-form solution using matrix inversion.

	---

	### Real-World Example

	Spotify's Discover Weekly (uses ALS):
	- 500 million users
	- 100 million songs
	- Billions of listening events per day
	- Needs to update recommendations daily
	- Solution: ALS parallelized across thousands of servers

	Netflix Prize (used SGD):
	- Research competition (2006)
	- 500K users, 18K movies
	- 100 million ratings (explicit 1-5 stars)
	- Goal: Best accuracy, not fastest training
	- Solution: SGD with carefully tuned learning rate and regularization

	---

	### Bottom Line

	SGD: Like a student who learns by doing practice problems one at a time.
	Slow but steady, easy to understand.

	ALS: Like a teacher who grades all tests at once using an answer key.
	Fewer iterations, but each iteration is more work. Much faster at scale.

	Both get you to the same destination — accurate recommendations!
	"""
	)

	# ── NMF ──────────────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### 4. NMF (Non-negative Matrix Factorization)")
	st.markdown("Matrix factorization with a twist: everything stays positive.")

	st.graphviz_chart("""
	digraph nmf {
	rankdir=LR
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	R [label="R\\nUser-Item Matrix\\n(all values ≥ 0)", fillcolor="#dbeafe", color="#3b82f6"]
	approx [label="≈", shape=plaintext, fontsize=24]
	W [label="W\\nUser factors\\n(all values ≥ 0)\\nadditive parts only", fillcolor="#d1fae5", color="#10b981"]
	times [label="×", shape=plaintext, fontsize=24]
	H [label="H\\nItem factors\\n(all values ≥ 0)\\nadditive parts only", fillcolor="#fce7f3", color="#ec4899"]

	R -> approx [arrowhead=none]
	approx -> W [arrowhead=none]
	W -> times [arrowhead=none]
	times -> H [arrowhead=none]
	}
	""")

	st.markdown(
	"""
	How it's different from SVD:
	- SVD factors can be negative (hard to interpret: what does a "-0.5 kitchen" preference mean?)
	- NMF forces all values to be ≥ 0 — factors are purely additive

	Why non-negative is useful:
	- A user is "0.8 kitchen + 0.3 vintage + 0 party" — easy to interpret
	- Each factor represents an additive "part" of the user's taste
	- The results are naturally sparse — many factors are exactly zero
	- This makes NMF factors more interpretable as "topics" or "themes"

	For example, if SVD says "you like kitchen items but with a negative party component",
	NMF says "you're 80% kitchen enthusiast, 30% vintage lover, and don't care about party
	supplies." The NMF interpretation is more natural for recommendation.

	Strengths: Interpretable factors, naturally sparse, good for parts-based decomposition.
	Weaknesses: Only works with non-negative data, sensitive to initialization.
	"""
	)

	with st.expander("Step-by-step numerical walkthrough — NMF", expanded=False):
	st.markdown(
	"""
	NMF follows the same learn-vectors-from-data approach, but with one strict rule:
	every number must be ≥ 0. No negatives allowed.

	Starting point — same matrix:

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Customer A \| 5 \| 3 \| ? \| 1 \|
	\| Customer B \| 4 \| ? \| 2 \| ? \|
	\| Customer C \| ? \| 4 \| 5 \| 3 \|

	All values are already ≥ 0 (purchase counts), so NMF can work with this directly.

	---

	Step 1 — Create random non-negative vectors (k=2)

	\| User \| f1 ("kitchen") \| f2 ("decor") \|
	\|---\|---\|---\|
	\| A \| 0.8 \| 0.2 \|
	\| B \| 0.6 \| 0.3 \|
	\| C \| 0.1 \| 0.9 \|

	\| Item \| f1 ("kitchen") \| f2 ("decor") \|
	\|---\|---\|---\|
	\| Cake Stand \| 0.9 \| 0.1 \|
	\| Candles \| 0.3 \| 0.7 \|
	\| Baking Set \| 0.2 \| 0.8 \|
	\| Bunting \| 0.1 \| 0.6 \|

	Notice: no negative values anywhere. The factors labeled "kitchen" and "decor"
	are what the algorithm might discover — we don't name them, but after training
	they often correspond to interpretable themes.

	---

	Step 2 — Predict

	Customer A → Cake Stand:

	`predicted = [0.8, 0.2] · [0.9, 0.1] = 0.72 + 0.02 = 0.74`

	Actual = 5. Error = 5 - 0.74 = 4.26 (large).

	---

	Step 3 — Update rule (multiplicative, not additive)

	Here's what makes NMF different from SGD. Instead of adding/subtracting a gradient,
	NMF uses multiplicative updates that guarantee values stay non-negative:
	"""
	)
	st.latex(r"W_{ij} \leftarrow W_{ij} \times \frac{(R \cdot H^T)_{ij}}{(W \cdot H \cdot H^T)_{ij}}")
	st.latex(r"H_{ij} \leftarrow H_{ij} \times \frac{(W^T \cdot R)_{ij}}{(W^T \cdot W \cdot H)_{ij}}")
	st.markdown(
	"""
	The key insight: we multiply by a ratio. Since all values are positive,
	the ratio is positive, so the result stays positive. If the ratio > 1,
	the value increases; if < 1, it decreases. Values can approach zero but
	never go negative.

	---

	Step 4 — After many iterations

	\| User \| f1 ("kitchen") \| f2 ("decor") \|
	\|---\|---\|---\|
	\| A \| 2.1 \| 0.0 \|
	\| B \| 1.7 \| 0.4 \|
	\| C \| 0.0 \| 2.3 \|

	\| Item \| f1 ("kitchen") \| f2 ("decor") \|
	\|---\|---\|---\|
	\| Cake Stand \| 2.3 \| 0.1 \|
	\| Candles \| 0.5 \| 1.6 \|
	\| Baking Set \| 0.2 \| 2.1 \|
	\| Bunting \| 0.3 \| 1.2 \|

	Notice how some values are exactly 0 — NMF produces sparse factors.

	Reading the result:
	- Customer A: [2.1, 0.0] → purely a "kitchen" buyer, no "decor" interest
	- Customer C: [0.0, 2.3] → purely a "decor" buyer
	- Cake Stand: [2.3, 0.1] → almost entirely a "kitchen" item
	- Baking Set: [0.2, 2.1] → mostly a "decor" item

	---

	Step 5 — Predict missing values

	Customer A → Baking Set (`?`):

	`predicted = [2.1, 0.0] · [0.2, 2.1] = 0.42 + 0.0 = 0.42` → low score

	Makes sense — A is a "kitchen" buyer, Baking Set is mostly "decor."

	Customer C → Cake Stand (`?`):

	`predicted = [0.0, 2.3] · [2.3, 0.1] = 0.0 + 0.23 = 0.23` → low score

	Makes sense — C is a "decor" buyer, Cake Stand is mostly "kitchen."

	---

	Why NMF factors are more interpretable:

	\| Method \| Factor example \| Interpretation \|
	\|---\|---\|---\|
	\| SVD \| [0.7, -0.3] \| "0.7 of something, minus 0.3 of something else" — hard to explain \|
	\| NMF \| [2.1, 0.0] \| "Strong kitchen interest, zero decor interest" — clear and additive \|

	Each factor in NMF represents a positive contribution. A user's taste is a
	sum of parts, not a mix of additions and subtractions.
	"""
	)

	# ── Item-Based CF ────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### 5. Item-Based Collaborative Filtering")
	st.markdown("No factorization — just find similar items.")

	st.graphviz_chart("""
	digraph itemcf {
	rankdir=TB
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	bought [label="Customer bought\\nCake Stand", fillcolor="#dbeafe", color="#3b82f6"]
	sim [label="Find items with\\nsimilar purchase patterns\\n(cosine similarity)", fillcolor="#e0e7ff", color="#6366f1"]
	items [label="Similar items:\\n1. Cake Tins (0.85)\\n2. Baking Set (0.79)\\n3. Cookie Cutters (0.72)", fillcolor="#d1fae5", color="#10b981"]
	rec [label="Recommend\\nthese items!", fillcolor="#fef3c7", color="#f59e0b"]

	bought -> sim -> items -> rec
	}
	""")

	st.markdown(
	"""
	How it works:
	1. For every pair of items, compute cosine similarity based on their purchase patterns
	(items bought by similar sets of customers are similar)
	2. When recommending for a user, look at items they've already bought
	3. For each purchased item, find its most similar items
	4. Rank by similarity score and recommend the top ones (excluding already-purchased)

	Why cosine similarity? It measures the angle between two vectors, ignoring magnitude.
	Two items with purchase patterns [1, 0, 1, 1] and [3, 0, 2, 4] have high cosine similarity
	(same direction) even though the quantities differ.

	Strengths: Simple, intuitive, no training needed, easy to explain.
	Weaknesses: Doesn't discover latent factors, doesn't work well with very sparse data,
	O(n²) item-pair computation.
	"""
	)

	with st.expander("Step-by-step numerical walkthrough — Item-Based CF", expanded=False):
	st.markdown(
	"""
	Item-CF takes a completely different approach: no vectors, no training,
	no latent factors. It works purely on similarity between items.

	Starting point — the same user-item matrix:

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Customer A \| 5 \| 3 \| 0 \| 1 \|
	\| Customer B \| 4 \| 0 \| 2 \| 0 \|
	\| Customer C \| 0 \| 4 \| 5 \| 3 \|

	(0 = not purchased)

	---

	Step 1 — Represent each item as a vector of who bought it

	Each column is already a vector:
	- Cake Stand: [5, 4, 0]
	- Candles: [3, 0, 4]
	- Baking Set: [0, 2, 5]
	- Bunting: [1, 0, 3]

	---

	Step 2 — Compute cosine similarity between every pair of items

	Cosine similarity measures the angle between two vectors (ignoring magnitude):
	"""
	)
	st.latex(r"\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \times \|\vec{b}\|}")
	st.markdown(
	"""
	Example — Cake Stand vs Candles:

	`dot product = (5×3) + (4×0) + (0×4) = 15`
	`\|Cake Stand\| = √(5² + 4² + 0²) = √41 = 6.40`
	`\|Candles\| = √(3² + 0² + 4²) = √25 = 5.00`

	`cosine = 15 / (6.40 × 5.00) = 15 / 32.0 = 0.47`

	Example — Candles vs Bunting:

	`dot product = (3×1) + (0×0) + (4×3) = 15`
	`\|Candles\| = 5.00`
	`\|Bunting\| = √(1² + 0² + 3²) = √10 = 3.16`

	`cosine = 15 / (5.00 × 3.16) = 15 / 15.81 = 0.95`

	High similarity — Candles and Bunting are bought by the same types of customers.

	---

	Step 3 — Build the similarity matrix

	After computing all pairs:

	\| \| Cake Stand \| Candles \| Baking Set \| Bunting \|
	\|---\|---\|---\|---\|---\|
	\| Cake Stand \| 1.00 \| 0.47 \| 0.31 \| 0.16 \|
	\| Candles \| 0.47 \| 1.00 \| 0.88 \| 0.95 \|
	\| Baking Set \| 0.31 \| 0.88 \| 1.00 \| 0.93 \|
	\| Bunting \| 0.16 \| 0.95 \| 0.93 \| 1.00 \|

	Candles, Baking Set, and Bunting cluster together (high mutual similarity).
	Cake Stand is somewhat different.

	---

	Step 4 — Generate recommendations for Customer B

	Customer B bought: Cake Stand (4), Baking Set (2)

	For each item B hasn't bought, score it by similarity to what B already bought:

	Candles:
	`score = (sim(Candles, Cake Stand) × 4) + (sim(Candles, Baking Set) × 2)`
	`= (0.47 × 4) + (0.88 × 2) = 1.88 + 1.76 = 3.64`

	Bunting:
	`score = (sim(Bunting, Cake Stand) × 4) + (sim(Bunting, Baking Set) × 2)`
	`= (0.16 × 4) + (0.93 × 2) = 0.64 + 1.86 = 2.50`

	Ranking: Candles (3.64) > Bunting (2.50)
	→ Recommend Candles first.

	---

	Why this works without any training:

	There are no vectors to learn. The similarity matrix is computed directly
	from the data. When the data changes (new purchases), you just recompute
	the similarity scores — no model to retrain.

	Trade-off: This simplicity comes at a cost. With 10,000 products, you
	need to compute 50 million similarity pairs. Matrix factorization methods
	reduce each item to k numbers (e.g., 50), making everything faster.
	"""
	)

	# ── Comparison Table ─────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### Quick Comparison")
	st.markdown(
	"""
	\| \| SVD \| ALS \| SGD (Funk SVD) \| NMF \| Item-Based CF \|
	\|---\|---\|---\|---\|---\|---\|
	\| Approach \| Decompose R = UΣVᵀ \| Alternate optimizing U and V \| Gradient descent + bias terms \| Decompose R ≈ W×H (non-negative) \| Find similar items via cosine \|
	\| Feedback type \| Explicit (treats 0 as 0) \| Implicit (0 = unknown, not dislike) \| Either (skips missing entries) \| Non-negative values \| Either \|
	\| Bias terms \| No \| Yes (implicit) \| Yes (μ + bᵤ + bᵢ) \| No \| No \|
	\| Interpretability \| Low (negative factors) \| Low \| Low \| High (additive parts) \| Very high (similar items) \|
	\| Scalability \| Good \| Excellent (parallelizable) \| Moderate (sequential updates) \| Moderate \| Poor for many items \|
	\| Used by \| Research, baselines \| Spotify, Netflix, LinkedIn \| Netflix Prize, baselines \| Topic modeling, parts decomposition \| Amazon ("customers also bought") \|
	"""
	)

	# ── Why Each Model ────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### Why We Chose Each Model")

	st.markdown(
	"""
	#### SVD — The Mathematical Baseline

	\| \| \|
	\|---\|---\|
	\| Why we included it \| Same role as Logistic Regression in churn — the simplest factorization approach. If other models can't beat it, their complexity isn't adding value. \|
	\| Where it's used \| Dimensionality reduction (PCA is SVD under the hood), text analysis (Latent Semantic Analysis), image compression. In recommendations: primarily as a baseline. \|
	\| Key benefit \| One-step computation with no iterative training. Mathematically exact decomposition. \|
	\| Limitation \| Treats every empty cell as zero — "didn't buy" is treated as "dislikes." This is a fundamental problem for e-commerce data. \|
	\| When to pick this \| When you have explicit ratings (1–5 stars) with few missing values, or as a quick baseline before trying more sophisticated methods. \|

	---

	#### ALS — The Industry Standard

	\| \| \|
	\|---\|---\|
	\| Why we included it \| It's the go-to algorithm for implicit feedback (purchases, clicks, views) — which is exactly what our e-commerce data is. \|
	\| Where it's used \| Spotify (Discover Weekly playlists), Netflix (movie recommendations), LinkedIn (job recommendations), Apple Music. \|
	\| Key benefit \| Designed specifically for implicit data. Treats "didn't buy" as "unknown" (not "dislike"). Highly parallelizable — scales to billions of interactions. \|
	\| Limitation \| Harder to implement from scratch. Requires careful tuning of regularization and confidence weights. \|
	\| When to pick this \| When your data is implicit feedback (purchases, clicks, views) and you need to scale to large datasets. This is the default choice for most recommendation systems. \|

	---

	#### SGD (Funk SVD) — The Netflix Prize Winner

	\| \| \|
	\|---\|---\|
	\| Why we included it \| It's the algorithm that proved matrix factorization works for recommendations (Netflix Prize, 2006). It introduces bias terms that the other methods lack. \|
	\| Where it's used \| The Netflix Prize competition. Research baselines. Systems where explicit ratings exist. Foundation for modern embedding-based models. \|
	\| Key benefit \| Only trains on observed interactions (skips missing entries). Bias terms (μ + bᵤ + bᵢ) anchor predictions to the correct scale. Flexible and well-understood. \|
	\| Limitation \| Sequential updates (one rating at a time) — hard to parallelize. Slower than ALS on large datasets. Sensitive to learning rate. \|
	\| When to pick this \| When you have explicit ratings and want the most theoretically grounded approach. Or when you need bias terms for better calibration. \|

	---

	#### NMF — The Interpretable Option

	\| \| \|
	\|---\|---\|
	\| Why we included it \| It produces factors you can actually explain — "this customer is 80% kitchen enthusiast, 20% vintage lover" — because all values are non-negative. \|
	\| Where it's used \| Topic modeling in text analysis (discovering themes in documents), audio source separation, parts-based image decomposition, recommendation with interpretability requirements. \|
	\| Key benefit \| The non-negativity constraint produces sparse, additive factors. Each factor represents a "part" of the user's taste. Easier to explain to business stakeholders than SVD or ALS. \|
	\| Limitation \| Only works with non-negative data. Treats missing values as zero (same problem as SVD). Sensitive to initialization. \|
	\| When to pick this \| When you need to explain why an item was recommended in terms of interpretable themes, or when factor analysis is a goal alongside recommendation. \|

	---

	#### Item-Based CF — The Simplest Approach

	\| \| \|
	\|---\|---\|
	\| Why we included it \| It's fundamentally different from the other four (no matrix factorization at all). Including it shows whether learning latent vectors actually helps versus just using raw similarity. \|
	\| Where it's used \| Amazon ("Customers who bought this also bought..."), e-commerce product pages, any "similar items" widget. \|
	\| Key benefit \| No training needed — just compute cosine similarity. Easy to explain: "We recommend this because people who bought what you bought also bought this." New items with a few purchases can be recommended immediately. \|
	\| Limitation \| Requires computing similarity between every pair of items — O(n²). Doesn't discover hidden patterns. Struggles with very sparse data. \|
	\| When to pick this \| When you need a fast, explainable system without a training pipeline, or for "similar items" sections on product pages. \|

	---

	#### How They Complement Each Other

	\| Role \| Model \| Why \|
	\|---\|---\|---\|
	\| Baseline \| SVD \| Sets the floor — simplest factorization \|
	\| Production model \| ALS \| Best for implicit feedback at scale \|
	\| Calibrated predictions \| SGD \| Bias terms give better-scaled scores \|
	\| Interpretable factors \| NMF \| When you need to explain why \|
	\| Product page recs \| Item-CF \| "Customers also bought" — no training needed \|
	\| Live updates \| ALS (Live page) \| Fast enough for repeated retraining \|

	In a real deployment, you might use ALS for homepage "recommended for you,"
	Item-CF for product pages ("similar items"), and NMF for internal analysis
	of customer segments.
	"""
	)

	# ── Full Picture ─────────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### The Complete Training Pipeline")

	st.graphviz_chart("""
	digraph pipeline {
	rankdir=TB
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	s1 [label="Start with random vectors\\n(user vectors + item vectors)", fillcolor="#dbeafe", color="#3b82f6"]
	s2 [label="Predict a rating\\nμ + b_user + b_item + (u⃗ · v⃗)", fillcolor="#d1fae5", color="#10b981"]
	s3 [label="Calculate error\\n(actual − predicted)", fillcolor="#fef3c7", color="#f59e0b"]
	s4 [label="Adjust vectors\\nto reduce error", fillcolor="#fce7f3", color="#ec4899"]
	s5 [label="Repeat for all ratings\\n(1 epoch)", fillcolor="#e0e7ff", color="#6366f1"]
	s6 [label="Repeat for many epochs\\nuntil error is small", fillcolor="#dbeafe", color="#3b82f6"]
	s7 [label="Vectors are now trained\\nUse them to predict missing values", fillcolor="#d1fae5", color="#10b981"]

	s1 -> s2 -> s3 -> s4 -> s5 -> s6 -> s7
	s6 -> s2 [label="next epoch", style=dashed, constraint=false]
	}
	""")

	st.markdown(
	"""
	Where each algorithm differs:

	\| Method \| How it adjusts vectors (Step 4) \| Bias terms? \| Skips missing entries? \|
	\|---\|---\|---\|---\|
	\| SGD (Funk SVD) \| Adjust one rating at a time using gradient descent \| Yes (μ + bᵤ + bᵢ) \| Yes — only trains on observed data \|
	\| ALS \| Fix all user vectors, solve all item vectors exactly (then swap) \| Yes (μ + bᵤ + bᵢ) \| Yes — uses confidence weights \|
	\| SVD \| No training loop — computes vectors mathematically in one step \| No (operates on raw matrix) \| No — treats missing as 0 \|
	\| NMF \| Multiplicative update: multiply by a ratio to stay non-negative \| No (factors absorb the scale) \| No — treats missing as 0 \|
	\| Item-CF \| No vectors at all — computes similarity directly from purchase patterns \| No (score is weighted similarity sum) \| N/A \|

	All matrix factorization methods share the same goal: **find vectors for users and items
	such that their dot product approximates the observed ratings/purchase counts.**
	They just use different math to get there.
	"""
	)

	# ═══════════════════════════════════════════════════════════════════════════════
	# Tab 2 — Modern Architecture & Industry
	# ═══════════════════════════════════════════════════════════════════════════════
	with tab_modern:
	st.subheader("Modern Recommendation Architecture")
	st.markdown(
	"The algorithms we've built (SVD, SGD, ALS, NMF) are the foundation of recommendation systems. "
	"Here's how they evolved into what companies like YouTube, Spotify, and Netflix use today."
	)

	# ── Two-Stage Architecture ───────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### Two-Stage Architecture")
	st.markdown(
	"Modern production recommenders don't use a single model. They split the problem "
	"into two stages to handle scale (millions of items) efficiently."
	)

	st.graphviz_chart("""
	digraph twostage {
	rankdir=TB
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	catalog [label="Full Product Catalog\\n(millions of items)", fillcolor="#dbeafe", color="#3b82f6"]

	subgraph cluster_stage1 {
	label="Stage 1 — Candidate Generation"
	style=dashed
	color="#6366f1"
	fontname="Helvetica"
	fontsize=11

	tower [label="Two-Tower Neural Network\\nUser tower encodes user\\nItem tower encodes item", fillcolor="#e0e7ff", color="#6366f1"]
	embed [label="Embedding Similarity\\n(fast approximate\\nnearest neighbor search)", fillcolor="#e0e7ff", color="#6366f1"]
	cands [label="~1,000 Candidate Items\\n(rough filter, fast)", fillcolor="#fce7f3", color="#ec4899"]

	tower -> cands
	embed -> cands
	}

	subgraph cluster_stage2 {
	label="Stage 2 — Ranking Model"
	style=dashed
	color="#10b981"
	fontname="Helvetica"
	fontsize=11

	features [label="Rich Features\\nuser profile + item metadata\\n+ context + history", fillcolor="#d1fae5", color="#10b981"]
	dnn [label="Deep Neural Network\\nscores each candidate", fillcolor="#d1fae5", color="#10b981"]
	ranked [label="Top 10-50 items\\n(precise ranking)", fillcolor="#fef3c7", color="#f59e0b"]

	features -> dnn -> ranked
	}

	catalog -> tower
	catalog -> embed
	cands -> features
	}
	""")

	st.markdown(
	"""
	\| Stage \| Purpose \| Speed vs Quality \| What we built \|
	\|---\|---\|---\|---\|
	\| Candidate Generation \| Quickly narrow millions of items to ~1,000 plausible candidates \| Fast, lower precision \| Our SVD/SGD/ALS/NMF — they produce a rough score for every item \|
	\| Ranking \| Precisely rank those ~1,000 using rich features \| Slower, high precision \| Not in our demo — requires deep learning infrastructure \|

	Why two stages? Scoring all items with a complex deep network is too slow.
	The first stage (where our algorithms live) acts as a fast filter. The second
	stage applies a heavyweight model only to the surviving candidates.
	"""
	)

	# ── What Matrix Factorization Became ─────────────────────────────────────
	st.markdown("---")
	st.markdown("### How Matrix Factorization Evolved Into Embeddings")

	st.graphviz_chart("""
	digraph evolution {
	rankdir=LR
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	mf [label="Matrix Factorization\\n(2006-2015)\\n\\nuser_vector (k dims)\\nitem_vector (k dims)\\n\\nScore = dot product", fillcolor="#dbeafe", color="#3b82f6"]
	embed [label="Embedding Learning\\n(2015-present)\\n\\nuser_embedding (k dims)\\nitem_embedding (k dims)\\n\\nLearned inside neural nets", fillcolor="#d1fae5", color="#10b981"]
	deep [label="Deep Recommender\\n(current state)\\n\\nuser_embedding\\nitem_embedding\\n+ context features\\n+ sequence history\\n+ non-linear layers", fillcolor="#fef3c7", color="#f59e0b"]

	mf -> embed [label=" same concept,\\n new training"]
	embed -> deep [label=" add more\\n signals"]
	}
	""")

	st.markdown(
	"""
	Matrix factorization didn't disappear — it became the foundation of modern embeddings:

	\| Concept \| Matrix Factorization (what we built) \| Deep Recommender (industry today) \|
	\|---\|---\|---\|
	\| User representation \| User vector from SGD/ALS (k numbers) \| User embedding learned by neural network \|
	\| Item representation \| Item vector from SGD/ALS (k numbers) \| Item embedding learned by neural network \|
	\| Score computation \| Dot product: `user_vec · item_vec` \| Neural network with multiple layers \|
	\| Input data \| Interactions only (purchases, clicks) \| Interactions + user profile + item metadata + time + context \|
	\| Relationships \| Linear (dot product) \| Non-linear (neural network can learn complex patterns) \|
	\| New items \| Cannot handle (no interactions = no vector) \| Can use item features (title, category, image) \|

	The user/item vectors from our ALS model are conceptually identical to the
	embeddings inside a modern deep recommender — the difference is how they're
	trained and what additional signals they incorporate.
	"""
	)

	# ── YouTube Example ──────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### Real Industry Example — YouTube")
	st.markdown(
	"YouTube's recommendation system (described in their 2016 paper *Deep Neural Networks "
	"for YouTube Recommendations*) follows the exact two-stage pattern:"
	)

	st.graphviz_chart("""
	digraph youtube {
	rankdir=TB
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	corpus [label="YouTube Video Corpus\\n(hundreds of millions of videos)", fillcolor="#dbeafe", color="#3b82f6"]

	subgraph cluster_yt1 {
	label="Candidate Generation"
	style=dashed
	color="#6366f1"
	fontname="Helvetica"

	yt_tower [label="Two-Tower DNN\\n\\nInputs: watch history,\\nsearch history, demographics\\n\\nOutput: user embedding", fillcolor="#e0e7ff", color="#6366f1"]
	yt_ann [label="Approximate Nearest\\nNeighbor Search\\n\\n→ ~hundreds of\\nvideo candidates", fillcolor="#fce7f3", color="#ec4899"]
	}

	subgraph cluster_yt2 {
	label="Ranking"
	style=dashed
	color="#10b981"
	fontname="Helvetica"

	yt_features [label="Hundreds of Features\\n\\nvideo age, channel quality,\\nuser language, device,\\ntime of day, watch %...", fillcolor="#d1fae5", color="#10b981"]
	yt_rank [label="Deep Ranking Network\\n\\nPredicts: P(watch)\\nP(like), P(share)...", fillcolor="#fef3c7", color="#f59e0b"]
	}

	yt_result [label="Top ~20 videos\\nshown to user", fillcolor="#d1fae5", color="#10b981"]

	corpus -> yt_tower -> yt_ann -> yt_features -> yt_rank -> yt_result
	}
	""")

	st.markdown(
	"""
	\| Component \| YouTube's Approach \| Our Demo Equivalent \|
	\|---\|---\|---\|
	\| User representation \| Learned embedding from watch/search history \| User vector from SGD/ALS/SVD \|
	\| Item representation \| Video embedding from features \| Item vector from SGD/ALS/SVD \|
	\| Candidate generation \| Two-tower DNN + ANN search \| Our full prediction matrix \|
	\| Ranking \| Deep network with hundreds of features \| Not in our demo \|
	\| Scale \| Hundreds of millions of videos, billions of users \| 3,665 items, 4,338 users \|
	"""
	)

	# ── Limitations of MF ────────────────────────────────────────────────────
	st.markdown("---")
	st.markdown("### Limitations of Matrix Factorization vs Deep Learning")

	st.markdown(
	"""
	\| Limitation \| Matrix Factorization (our models) \| Deep Learning Solution \|
	\|---\|---\|---\|
	\| Only uses interactions \| Can only learn from purchases — knows nothing about the product itself \| Can incorporate item metadata (description, category, price, images) \|
	\| Ignores context \| Same recommendation regardless of time, device, or situation \| Uses context: time of day, device, location, season \|
	\| No sequence awareness \| Treats all purchases equally — doesn't know the order \| Sequence models (RNN, Transformer) understand "watched A, then B, likely wants C" \|
	\| Linear relationships \| Score = dot product (linear combination of factors) \| Neural networks learn non-linear patterns and complex interactions \|
	\| Cold start \| New item with no purchases = no vector = cannot recommend \| Can use item features to generate embedding even with zero interactions \|
	\| Static \| User vector doesn't change between retraining \| Real-time models update user embedding with every interaction \|
	"""
	)

	st.markdown("---")
	st.markdown("### Where Our Demo Fits")

	st.graphviz_chart("""
	digraph spectrum {
	rankdir=LR
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.2, style=dashed]

	pop [label="Popularity\\nBaseline\\n(recommend\\nbest-sellers)", fillcolor="#fee2e2", color="#ef4444"]
	cf [label="Item-Based CF\\n(cosine similarity)\\n\\nOur demo ✓", fillcolor="#fef3c7", color="#f59e0b"]
	mf [label="Matrix Factorization\\n(SVD, SGD, ALS, NMF)\\n\\nOur demo ✓", fillcolor="#d1fae5", color="#10b981"]
	fm [label="Factorization\\nMachines\\n(adds features\\nto MF)", fillcolor="#e0e7ff", color="#6366f1"]
	deep [label="Deep Learning\\n(two-tower,\\nTransformer)\\n\\nIndustry today", fillcolor="#dbeafe", color="#3b82f6"]

	pop -> cf -> mf -> fm -> deep
	}
	""")

	st.markdown(
	"""
	Our demo implements the middle of the spectrum — the algorithms that
	established the field and are still used as candidate generators in production
	systems. Matrix factorization remains the conceptual backbone: the user and item
	vectors we learn with ALS are the direct ancestors of the embeddings inside
	YouTube's, Spotify's, and Netflix's deep recommenders.

	For a data science project, this is the right starting point. You demonstrate
	the core concepts (latent factors, matrix decomposition, bias terms, implicit feedback),
	then scale up to deep learning when the business case justifies the infrastructure.
	"""
	)

	st.markdown("---")
	st.info(
	"Note: The deep learning model described above (YouTube's Two-Tower Deep Neural Network "
	"with separate Candidate Generation and Ranking stages) is explained but not implemented "
	"in this demo. It requires TensorFlow/PyTorch, GPU resources, and significantly more data "
	"(YouTube trains on billions of interactions — our dataset has ~215K). Our 5 implemented models "
	"(SVD, ALS, SGD, NMF, Item-CF) are the foundational algorithms that YouTube's deep system evolved from."
	)

	# ═══════════════════════════════════════════════════════════════════════════════
	# Tab 3 — Model Comparison
	# ═══════════════════════════════════════════════════════════════════════════════
	with tab_compare:
	st.subheader("Model Performance Comparison")
	st.markdown(
	"We train all five algorithms on the same training set and evaluate "
	"on held-out test interactions using Precision@10, Recall@10, and Hit Rate."
	)

	interactions, user_map, item_map, item_desc, _ = build_interaction_matrix()
	train_df, train_sparse, test_df, _, _, _ = get_rec_train_test()
	train_dense = train_df.values.astype(np.float32)

	user_index = {cust: i for i, cust in enumerate(train_df.index)}
	item_columns = list(train_df.columns)

	n_factors = st.slider("Number of latent factors (k)", 10, 100, 50, step=10)

	st.caption(
	"💡 Note: Training results are cached by Streamlit. When you change k and click train, "
	"the models will retrain with the new value. Subsequent clicks with the same k will be instant (cached). "
	"Watch the timing indicators to see actual training vs. cache hits."
	)

	# Show if results exist from previous training
	if "rec_results" in st.session_state and "rec_n_factors" in st.session_state:
	prev_k = st.session_state.rec_n_factors
	if prev_k == n_factors:
	st.info(f"📊 Results from previous training with k={n_factors} are shown below. Click train button to refresh.")
	else:
	st.warning(f"⚠️ Current results are from k={prev_k}. Click train button to retrain with k={n_factors}.")

	if st.button("Train & Evaluate All Models", type="primary", use_container_width=True):
	import time
	start_time = time.time()
	results = {}

	progress_bar = st.progress(0, text="Starting training...")

	with st.spinner("Training SVD..."):
	t0 = time.time()
	svd_pred, *_ = train_svd(train_sparse, n_factors=n_factors)
	results["SVD"] = evaluate_recommendations(svd_pred, train_sparse, test_df, user_index, item_columns)
	elapsed = time.time() - t0
	st.success(f"SVD trained ({elapsed:.2f}s)")
	progress_bar.progress(0.2, text="SVD complete, training ALS...")

	with st.spinner("Training ALS..."):
	t0 = time.time()
	als_model = train_als(train_sparse, n_factors=n_factors)
	als_user = als_model.user_factors
	als_item = als_model.item_factors
	als_pred = als_user @ als_item.T
	results["ALS"] = evaluate_recommendations(als_pred, train_sparse, test_df, user_index, item_columns)
	elapsed = time.time() - t0
	st.success(f"ALS trained ({elapsed:.2f}s)")
	progress_bar.progress(0.4, text="ALS complete, training SGD...")

	with st.spinner("Training SGD (Funk SVD) — this may take ~60s on first run..."):
	t0 = time.time()
	sgd_pred, *_ = train_sgd(train_sparse, n_factors=n_factors, n_epochs=30)
	results["SGD"] = evaluate_recommendations(sgd_pred, train_sparse, test_df, user_index, item_columns)
	elapsed = time.time() - t0
	st.success(f"SGD trained ({elapsed:.2f}s)")
	progress_bar.progress(0.6, text="SGD complete, training NMF...")

	with st.spinner("Training NMF..."):
	t0 = time.time()
	nmf_pred, *_ = train_nmf(train_dense, n_factors=n_factors)
	results["NMF"] = evaluate_recommendations(nmf_pred, train_sparse, test_df, user_index, item_columns)
	elapsed = time.time() - t0
	st.success(f"NMF trained ({elapsed:.2f}s)")
	progress_bar.progress(0.8, text="NMF complete, training Item-CF...")

	with st.spinner("Training Item-Based CF..."):
	t0 = time.time()
	cf_pred, _ = train_item_cf(train_sparse)
	results["Item-CF"] = evaluate_recommendations(cf_pred, train_sparse, test_df, user_index, item_columns)
	elapsed = time.time() - t0
	st.success(f"Item-Based CF trained ({elapsed:.2f}s)")
	progress_bar.progress(1.0, text="All models trained!")

	total_time = time.time() - start_time
	st.info(f"✅ All 5 models trained with k={n_factors} latent factors in {total_time:.2f}s total")

	# Store in session state for persistence
	st.session_state.rec_results = results
	st.session_state.rec_n_factors = n_factors
	st.session_state.rec_svd_pred = svd_pred
	st.session_state.rec_als_pred = als_pred
	st.session_state.rec_sgd_pred = sgd_pred
	st.session_state.rec_nmf_pred = nmf_pred
	st.session_state.rec_cf_pred = cf_pred
	st.session_state["rec_models_trained"] = True
	st.session_state["svd_pred"] = svd_pred
	st.session_state["als_pred"] = als_pred
	st.session_state["sgd_pred"] = sgd_pred
	st.session_state["nmf_pred"] = nmf_pred
	st.session_state["cf_pred"] = cf_pred
	st.session_state["train_df"] = train_df
	st.session_state["item_desc"] = item_desc
	st.session_state["item_map"] = item_map
	st.session_state["user_map"] = user_map

	st.rerun()

	# Display results and metrics explanation section
	if "rec_results" in st.session_state and st.session_state.get("rec_models_trained", False):
	st.markdown("---")
	st.markdown("### Results")

	results = st.session_state.rec_results
	k_used = st.session_state.rec_n_factors

	st.caption(f"Results from training with k={k_used} latent factors")

	metrics_df = pd.DataFrame(results).T
	metrics_df = metrics_df[["Precision@K", "Recall@K", "Hit Rate", "RMSE", "R²", "Users Evaluated"]]

	# Format for display
	display_df = metrics_df.copy()
	display_df["Precision@K"] = display_df["Precision@K"].map("{:.4f}".format)
	display_df["Recall@K"] = display_df["Recall@K"].map("{:.4f}".format)
	display_df["Hit Rate"] = display_df["Hit Rate"].map("{:.2%}".format)
	display_df["RMSE"] = display_df["RMSE"].map("{:.2f}".format)
	display_df["R²"] = display_df["R²"].map("{:.4f}".format)
	display_df["Users Evaluated"] = display_df["Users Evaluated"].apply(lambda x: str(int(x)))

	st.dataframe(display_df, use_container_width=True)

	st.info(
	"Metric Guide:\n\n"
	"- Precision@K: Of the K items recommended, what % were actually purchased? (Higher is better)\n"
	"- Recall@K: Of all items the user purchased, what % did we recommend? (Higher is better)\n"
	"- Hit Rate: % of users who got at least one relevant recommendation (Higher is better)\n"
	"- RMSE: Root Mean Squared Error - measures prediction accuracy for quantities (Lower is better)\n"
	"- R²: How well the model explains variance in purchase quantities (1.0 = perfect, 0 = baseline, negative = worse than baseline)"
	)

	raw_results = pd.DataFrame(results).T

	# Create two separate charts: one for ranking metrics, one for prediction metrics
	col1, col2 = st.columns(2)

	with col1:
	st.markdown("Ranking Metrics (Higher is Better)")
	fig_ranking = go.Figure()
	for metric in ["Precision@K", "Recall@K", "Hit Rate"]:
	fig_ranking.add_trace(go.Bar(
	name=metric,
	x=list(results.keys()),
	y=raw_results[metric].values,
	text=[f"{v:.4f}" for v in raw_results[metric].values],
	textposition="auto",
	))
	fig_ranking.update_layout(
	barmode="group",
	yaxis_title="Score",
	height=400,
	showlegend=True,
	)
	st.plotly_chart(fig_ranking, use_container_width=True)

	with col2:
	st.markdown("Prediction Metrics")
	fig_pred = go.Figure()

	# RMSE (lower is better) - show on left y-axis
	fig_pred.add_trace(go.Bar(
	name="RMSE (lower is better)",
	x=list(results.keys()),
	y=raw_results["RMSE"].values,
	text=[f"{v:.2f}" for v in raw_results["RMSE"].values],
	textposition="auto",
	marker_color="lightcoral",
	))

	fig_pred.update_layout(
	yaxis_title="RMSE (lower is better)",
	height=400,
	showlegend=True,
	)
	st.plotly_chart(fig_pred, use_container_width=True)

	# R² in a separate small chart
	st.markdown("R² Score (1.0 = perfect, 0 = baseline)")
	fig_r2 = go.Figure()
	fig_r2.add_trace(go.Bar(
	x=list(results.keys()),
	y=raw_results["R²"].values,
	text=[f"{v:.3f}" for v in raw_results["R²"].values],
	textposition="auto",
	marker_color="lightblue",
	showlegend=False,
	))
	fig_r2.update_layout(
	yaxis_title="R² Score",
	height=300,
	)
	st.plotly_chart(fig_r2, use_container_width=True)

	st.markdown("---")
	st.markdown("### Understanding These Metrics")

	st.markdown(
	"""
	#### How We Test

	Before training, we hide some items each customer actually purchased (the test set).
	After training, we ask each model: "What are your top 10 recommendations for this customer?"
	Then we check how many of those recommendations match the hidden items.
	"""
	)

	st.graphviz_chart("""
	digraph eval {
	rankdir=LR
	node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
	edge [color="#888888", penwidth=1.5]

	full [label="Customer's full\\npurchase history\\n(100 items)", fillcolor="#dbeafe", color="#3b82f6"]
	train [label="Training set\\n(80 items)\\nModel sees these", fillcolor="#d1fae5", color="#10b981"]
	test [label="Test set\\n(20 items)\\nHidden from model", fillcolor="#fee2e2", color="#ef4444"]
	rec [label="Model's top 10\\nrecommendations", fillcolor="#fef3c7", color="#f59e0b"]
	check [label="How many of the 10\\nare in the hidden 20?", fillcolor="#e0e7ff", color="#6366f1"]

	full -> train
	full -> test
	train -> rec [label=" model trained"]
	rec -> check
	test -> check [label=" compare"]
	}
	""")

	st.markdown(
	"""
	#### Example — Customer #12345

	This customer bought 100 products total. We hide 20 of them for testing.
	The model trains on the remaining 80. Then we ask for 10 recommendations.

	\| Model's Top 10 Recommendations \| In the hidden 20? \|
	\|---\|---\|
	\| 1. Cake Stand \| No \|
	\| 2. Vintage Teacup Set \| Yes (was in hidden test set) \|
	\| 3. Baking Tins \| No \|
	\| 4. Candle Holder \| No \|
	\| 5. Recipe Book Stand \| Yes (was in hidden test set) \|
	\| 6. Napkin Set \| No \|
	\| 7. Cookie Cutters \| No \|
	\| 8. Storage Jar \| No \|
	\| 9. Tea Towel \| No \|
	\| 10. Egg Cup Set \| No \|

	2 out of 10 recommendations matched hidden purchases.

	---

	#### What Each Metric Means for This Customer
	"""
	)

	st.latex(r"\text{Precision@10} = \frac{\text{correct recommendations}}{\text{total recommendations}} = \frac{2}{10} = 0.20")
	st.markdown(
	"""
	"20% of what we recommended was actually relevant."

	If you showed these 10 products on a homepage, 2 of them would be items
	the customer genuinely wanted. The other 8 might still be interesting
	(the customer just hasn't bought them yet), but we can only verify the ones in our test set.
	"""
	)

	st.latex(r"\text{Recall@10} = \frac{\text{correct recommendations}}{\text{total hidden items}} = \frac{2}{20} = 0.10")
	st.markdown(
	"""
	"We found 10% of the items the customer actually wanted."

	The customer had 20 hidden items we needed to find. We found 2 of them in our top 10.
	With only 10 recommendation slots and 3,665 total products, finding even 2 out of 20
	is meaningful — pure random guessing would find ~0.05 items.
	"""
	)

	st.markdown(
	"""
	Hit Rate asks a simpler question: "Did the customer get at least one good recommendation?"
	For this customer, the answer is Yes (2 hits > 0). If we repeat this across all 4,080
	evaluated customers:
	"""
	)

	st.latex(r"\text{Hit Rate} = \frac{\text{customers with} \geq 1 \text{ hit}}{\text{total customers evaluated}}")

	st.markdown(
	"""
	---

	#### Full Calculation — All 3 Metrics Across 5 Customers

	Here's how the final numbers are computed. Suppose we evaluate 5 customers:

	\| Customer \| Hidden Items \| Model's Top 10 \| Hits \| Precision@10 \| Recall@10 \| Hit? \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| A \| 20 items \| 2 matched \| 2 \| 2/10 = 0.20 \| 2/20 = 0.10 \| Yes \|
	\| B \| 8 items \| 0 matched \| 0 \| 0/10 = 0.00 \| 0/8 = 0.00 \| No \|
	\| C \| 15 items \| 1 matched \| 1 \| 1/10 = 0.10 \| 1/15 = 0.07 \| Yes \|
	\| D \| 5 items \| 0 matched \| 0 \| 0/10 = 0.00 \| 0/5 = 0.00 \| No \|
	\| E \| 12 items \| 3 matched \| 3 \| 3/10 = 0.30 \| 3/12 = 0.25 \| Yes \|
	"""
	)

	st.latex(r"\text{Precision@10} = \text{average} = \frac{0.20 + 0.00 + 0.10 + 0.00 + 0.30}{5} = 0.12")
	st.latex(r"\text{Recall@10} = \text{average} = \frac{0.10 + 0.00 + 0.07 + 0.00 + 0.25}{5} = 0.084")
	st.latex(r"\text{Hit Rate} = \frac{\text{3 customers with hits}}{\text{5 total}} = 60\%")

	st.markdown(
	"""
	In our actual evaluation, this calculation runs across ~4,080 customers instead of 5.

	How to present these numbers:
	- Precision@10 = 0.12 means "on average, 1.2 out of every 10 recommendations are items the customer actually wanted"
	- Recall@10 = 0.084 means "on average, the model finds 8.4% of a customer's wanted items in just 10 slots"
	- Hit Rate = 60% means "60% of customers got at least one relevant recommendation"

	---

	#### Why the Numbers Look Small — And Why That's Normal

	\| Metric \| Typical Value \| Seems low? \| Why it's actually good \|
	\|---\|---\|---\|---\|
	\| Precision@10 \| 0.01 – 0.10 \| Yes \| With 3,665 products, finding any relevant item in 10 slots is hard. Random guessing gives ~0.003. \|
	\| Recall@10 \| 0.005 – 0.05 \| Yes \| A customer may have 20+ hidden items. Finding even 1 in 10 slots means the model understood their taste. \|
	\| Hit Rate \| 3% – 15% \| Maybe \| This means 3–15% of customers got at least one useful recommendation out of 10. At scale, that drives revenue. \|

	---

	#### RMSE and R² — Prediction Accuracy Metrics

	While Precision/Recall/Hit Rate measure ranking quality ("Did we recommend the right items?"),
	RMSE and R² measure prediction accuracy ("How well did we predict purchase quantities?").

	RMSE (Root Mean Squared Error):
	"""
	)

	st.latex(r"\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\text{actual}_i - \text{predicted}_i)^2}")

	st.markdown(
	"""
	What it measures: Average error in predicting how many units a customer will buy.

	Example:

	\| Customer \| Item \| Actual Quantity \| Predicted Quantity \| Error \| Squared Error \|
	\|---\|---\|---\|---\|---\|---\|
	\| A \| Teacup \| 5 \| 4.2 \| -0.8 \| 0.64 \|
	\| A \| Candles \| 3 \| 5.1 \| +2.1 \| 4.41 \|
	\| B \| Cake Stand \| 2 \| 1.8 \| -0.2 \| 0.04 \|
	\| B \| Bunting \| 8 \| 6.5 \| -1.5 \| 2.25 \|
	\| C \| Baking Set \| 1 \| 1.3 \| +0.3 \| 0.09 \|

	RMSE = √[(0.64 + 4.41 + 0.04 + 2.25 + 0.09) / 5] = √(7.43 / 5) = √1.486 = 1.22 units

	Interpretation: On average, our predictions are off by about 1.22 units.

	- Lower is better (0 = perfect predictions)
	- RMSE = 1.5 means predictions are typically within ±1.5 units of actual
	- Useful when quantity matters (inventory planning, revenue forecasting)

	When RMSE is important:
	- Predicting how much inventory to stock
	- Forecasting revenue from recommendations
	- Understanding if the model captures purchase patterns accurately

	---

	R² (R-squared / Coefficient of Determination):
	"""
	)

	st.latex(r"R^2 = 1 - \frac{\sum (\text{actual} - \text{predicted})^2}{\sum (\text{actual} - \text{mean})^2}")

	st.markdown(
	"""
	What it measures: How much variance in purchase quantities the model explains.

	Think of it as: "How much better is the model than just guessing the average?"

	Example:

	Suppose the average purchase quantity across all items is 3.5 units.

	\| Customer \| Item \| Actual \| Mean Baseline \| Model Prediction \| Baseline Error² \| Model Error² \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| A \| Teacup \| 5 \| 3.5 \| 4.2 \| (5-3.5)² = 2.25 \| (5-4.2)² = 0.64 \|
	\| A \| Candles \| 3 \| 3.5 \| 5.1 \| (3-3.5)² = 0.25 \| (3-5.1)² = 4.41 \|
	\| B \| Cake Stand \| 2 \| 3.5 \| 1.8 \| (2-3.5)² = 2.25 \| (2-1.8)² = 0.04 \|
	\| B \| Bunting \| 8 \| 3.5 \| 6.5 \| (8-3.5)² = 20.25 \| (8-6.5)² = 2.25 \|
	\| C \| Baking Set \| 1 \| 3.5 \| 1.3 \| (1-3.5)² = 6.25 \| (1-1.3)² = 0.09 \|

	Sum of baseline errors² = 31.25
	Sum of model errors² = 7.43

	R² = 1 - (7.43 / 31.25) = 1 - 0.238 = 0.762

	Interpretation:

	\| R² Value \| Meaning \|
	\|---\|---\|
	\| 1.0 \| Perfect predictions (model explains 100% of variance) \|
	\| 0.76 \| Model explains 76% of variance (good!) \|
	\| 0.5 \| Model explains 50% of variance (moderate) \|
	\| 0.0 \| Model is no better than just predicting the mean \|
	\| Negative \| Model is worse than predicting the mean (bad fit!) \|

	Why R² can be negative in recommendations:

	Unlike regression on continuous data, recommendation systems face challenges:
	- Sparse data (most user-item pairs are empty)
	- Implicit feedback (purchases don't mean "liked X amount")
	- Cold start (new users/items have no history)

	A negative R² means the model's predictions are worse than just guessing the average
	purchase quantity. This often happens with:
	- SVD (treats missing as 0, distorts predictions)
	- NMF (non-negativity constraint can be too restrictive)
	- Models that overfit to training data

	When R² is important:
	- Comparing models on the same dataset
	- Understanding if latent factors capture meaningful patterns
	- Deciding if a complex model justifies its complexity

	---

	#### Ranking Metrics vs Prediction Metrics — Which to Use?

	\| Scenario \| Use These Metrics \| Why \|
	\|---\|---\|---\|
	\| Homepage "Recommended for You" \| Precision@K, Recall@K, Hit Rate \| You care about showing relevant items, not predicting exact quantities \|
	\| "You might also like" section \| Precision@K, Hit Rate \| Ranking quality matters most \|
	\| Inventory planning \| RMSE, R² \| You need accurate quantity predictions \|
	\| Revenue forecasting \| RMSE, R² \| Dollar amounts depend on predicted quantities \|
	\| A/B testing recommendations \| All metrics \| Compare both ranking quality and prediction accuracy \|
	\| Model selection \| Precision@K + RMSE \| Balance relevance with prediction accuracy \|

	In practice:
	- Precision@K and Hit Rate are most important for user-facing recommendations
	- RMSE and R² are useful for business planning and understanding model quality
	- A model can have good ranking (high Precision@K) but poor quantity prediction (low R²)
	- For e-commerce, ranking usually matters more than exact quantities

	---

	#### Simple Explanation for Non-Technical Stakeholders

	Precision@K: "Out of 10 recommendations, how many did the customer actually want?"
	→ Measures recommendation relevance

	Hit Rate: "What % of customers got at least one good recommendation?"
	→ Measures overall success rate

	RMSE: "When we predict a customer will buy 5 units, how far off are we on average?"
	→ Measures prediction error in units

	R²: "How much better are our predictions than just guessing the average?"
	→ Measures model quality (1.0 = perfect, 0 = no better than average, negative = worse than average)

	The business perspective: If you have 100,000 customers and Hit Rate is 8%, that's
	8,000 customers who received a relevant product suggestion they wouldn't have found otherwise.
	Even a 2% conversion rate on those turns into 160 additional sales — from a fully automated system.

	Comparing models: The absolute numbers matter less than the relative ranking.
	If ALS gets 8% Hit Rate and SVD gets 5%, ALS is clearly better at learning from
	this data — even though both numbers seem small in isolation.

	---

	#### Why Random Baselines Help Contextualize

	For reference, a random recommender on our dataset would achieve approximately:
	- Precision@10 ≈ 0.003 (3,665 items, ~20 relevant → 20/3,665 chance per slot)
	- Hit Rate ≈ 5% (10 random tries at a 0.5% per-item hit rate)

	Any model scoring above these baselines has learned real patterns from the data.
	"""
	)

	# ═══════════════════════════════════════════════════════════════════════════════
	# Tab 3 — Interactive Recommendations
	# ═══════════════════════════════════════════════════════════════════════════════
	with tab_demo:
	st.subheader("Interactive Recommendation Demo")
	st.markdown(
	"Select a customer and see what each model recommends. "
	"You can also see what the customer has actually purchased to judge quality."
	)

	if not st.session_state.get("rec_models_trained"):
	st.warning("Please train the models first in the Model Comparison tab.")
	else:
	train_df = st.session_state["train_df"]
	item_desc = st.session_state["item_desc"]
	item_map = st.session_state["item_map"]

	customer_list = list(train_df.index)
	active_customers = [c for c in customer_list if (train_df.loc[c] > 0).sum() >= 10]

	selected_customer = st.selectbox(
	"Select a Customer",
	active_customers[:100],
	format_func=lambda x: f"Customer {int(x)} ({int((train_df.loc[x] > 0).sum())} products purchased)",
	)

	user_idx = customer_list.index(selected_customer)

	st.markdown("#### What this customer has purchased")
	purchased = train_df.loc[selected_customer]
	purchased_items = purchased[purchased > 0].sort_values(ascending=False).head(15)
	purch_display = pd.DataFrame({
	"StockCode": purchased_items.index,
	"Quantity": purchased_items.values.astype(int),
	"Description": [item_desc.get(sc, "N/A") for sc in purchased_items.index],
	})
	st.dataframe(purch_display, use_container_width=True, hide_index=True)

	st.markdown("---")
	st.markdown("#### Recommendations from each model")

	model_preds = {
	"SVD": st.session_state["svd_pred"],
	"ALS": st.session_state["als_pred"],
	"SGD (Funk SVD)": st.session_state["sgd_pred"],
	"NMF": st.session_state["nmf_pred"],
	"Item-CF": st.session_state["cf_pred"],
	}

	n_recs = st.slider("Number of recommendations", 5, 20, 10)

	cols = st.columns(2)
	for i, (model_name, pred_matrix) in enumerate(model_preds.items()):
	with cols[i % 2]:
	st.markdown(f"{model_name}")
	top_items, top_scores = get_top_n_recommendations(
	pred_matrix, train_df.values, user_idx, n=n_recs,
	)
	rec_display = pd.DataFrame({
	"Rank": range(1, len(top_items) + 1),
	"StockCode": [train_df.columns[j] for j in top_items],
	"Score": [f"{s:.3f}" for s in top_scores],
	"Description": [item_desc.get(train_df.columns[j], "N/A") for j in top_items],
	})
	st.dataframe(rec_display, use_container_width=True, hide_index=True)

	st.markdown("---")
	st.markdown(
	"How to evaluate these recommendations: Look at the customer's purchase "
	"history above and check if the recommended items make sense. For example, if "
	"the customer bought a lot of kitchen items, good recommendations would include "
	"other kitchen products they haven't purchased yet."
	)