ml-demo / pages /7_Rec_Models.py
aliarafat-stack-ml's picture
first commit
f3a6f24
import streamlit as st
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from utils.rec_data_loader import build_interaction_matrix, get_rec_train_test
from utils.rec_models import (
train_svd, train_als, train_sgd, train_nmf, train_item_cf,
get_top_n_recommendations, evaluate_recommendations,
)
st.set_page_config(page_title="Rec Models", page_icon="🎯", layout="wide")
st.title("Product Recommendation β€” Models")
st.markdown("---")
tab_how, tab_modern, tab_compare, tab_demo = st.tabs([
"How The Algorithms Work", "Modern Architecture & Industry", "Model Comparison", "Interactive Recommendations"
])
# ═══════════════════════════════════════════════════════════════════════════════
# Tab 1 β€” How The Algorithms Work
# ═══════════════════════════════════════════════════════════════════════════════
with tab_how:
st.subheader("Recommendation Algorithms β€” A Visual Guide")
st.markdown(
"All these algorithms solve the same problem: **fill in the empty cells** "
"of the user-item matrix. They just approach it differently."
)
# ── Matrix Factorization overview ────────────────────────────────────────
st.markdown("---")
st.markdown("### The Core Idea: Matrix Factorization")
st.markdown(
"Matrix factorization is like discovering the **hidden reasons** behind purchases."
)
st.graphviz_chart("""
digraph mf {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
R [label="User-Item Matrix\\nR (4338 Γ— 3665)\\nMostly empty", fillcolor="#dbeafe", color="#3b82f6"]
approx [label="β‰ˆ", shape=plaintext, fontsize=24]
U [label="User Matrix\\nU (4338 Γ— k)\\nEach user as k\\nlatent preferences", fillcolor="#d1fae5", color="#10b981"]
times [label="Γ—", shape=plaintext, fontsize=24]
V [label="Item Matrix\\nV (k Γ— 3665)\\nEach item as k\\nlatent attributes", fillcolor="#fce7f3", color="#ec4899"]
R -> approx [arrowhead=none]
approx -> U [arrowhead=none]
U -> times [arrowhead=none]
times -> V [arrowhead=none]
}
""")
st.markdown(
"""
**Example:**
Suppose there are 3 hidden factors: "kitchen items", "vintage style", "party supplies".
- **User profile** (U): Customer 12345 = [0.9 kitchen, 0.1 vintage, 0.7 party]
- **Item profile** (V): "Cake Stand" = [0.8 kitchen, 0.3 vintage, 0.5 party]
- **Predicted score** = 0.9Γ—0.8 + 0.1Γ—0.3 + 0.7Γ—0.5 = **1.10** β†’ strong recommendation!
The model **discovers** these latent factors automatically from the purchase patterns.
We set the number of factors (k), and the algorithm figures out what they represent.
"""
)
with st.expander("Where do the vectors come from? (Step-by-step from scratch)", expanded=False):
st.markdown(
"""
At the start, **no vectors exist**. We only have raw purchase data:
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Customer A** | 5 | 3 | ? | 1 |
| **Customer B** | 4 | ? | 2 | ? |
| **Customer C** | ? | 4 | 5 | 3 |
The `?` entries are what we want to predict.
---
**Step 1 β€” Create random vectors**
We just assign random numbers. With k=2 factors:
| User | f1 | f2 |
|---|---|---|
| A | 0.3 | 0.7 |
| B | 0.5 | 0.2 |
| C | 0.1 | 0.9 |
| Item | f1 | f2 |
|---|---|---|
| Cake Stand | 0.4 | 0.6 |
| Candles | 0.8 | 0.3 |
| Baking Set | 0.2 | 0.7 |
| Bunting | 0.6 | 0.1 |
These are random guesses. They are wrong. That's the starting point for training.
---
**Step 2 β€” Predict using the full formula**
The raw dot product gives small numbers (e.g. 0.54), but purchase counts are on
a different scale. So the full prediction formula includes **bias terms**:
"""
)
st.latex(r"\text{predicted} = \underbrace{\mu}_{\text{global mean}} + \underbrace{b_u}_{\text{user bias}} + \underbrace{b_i}_{\text{item bias}} + \underbrace{\vec{u} \cdot \vec{v}}_{\text{dot product}}")
st.markdown(
"""
| Term | Value | What it captures |
|---|---|---|
| **ΞΌ** (global mean) | 3.4 | Average of all known values in the matrix |
| **bα΅€** (user bias) | +0.5 | Customer A buys more than the average customer |
| **bα΅’** (item bias) | +0.4 | Cake Stand is purchased more than the average item |
| **dot product** | 0.54 | Specific affinity between this user and this item |
`predicted = 3.4 + 0.5 + 0.4 + 0.54 = 4.84 β‰ˆ 5` βœ“
Without biases, the model would see enormous errors during early training
and the vectors would take much longer to converge β€” or might not converge at all.
---
**Step 3 β€” Calculate error, adjust, repeat**
`error = actual βˆ’ predicted = 5 βˆ’ 4.84 = 0.16`
The algorithm nudges the vectors and biases slightly to reduce this error,
then moves to the next rating and does the same. After thousands of passes
(epochs), the vectors converge to values that accurately reconstruct all
the known ratings β€” and can predict the `?` cells.
The individual algorithm walkthroughs below show exactly how each method
does this differently.
"""
)
# ── SVD ──────────────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### 1. SVD (Singular Value Decomposition)")
st.markdown("*The classic matrix factorization technique from linear algebra.*")
st.graphviz_chart("""
digraph svd {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
R [label="R\\nUser-Item\\nMatrix", fillcolor="#dbeafe", color="#3b82f6"]
eq [label="=", shape=plaintext, fontsize=24]
U [label="U\\nUser factors\\n(users Γ— k)", fillcolor="#d1fae5", color="#10b981"]
S [label="Ξ£\\nSingular values\\n(k Γ— k diagonal)\\nimportance weights", fillcolor="#fef3c7", color="#f59e0b"]
Vt [label="Vα΅€\\nItem factors\\n(k Γ— items)", fillcolor="#fce7f3", color="#ec4899"]
R -> eq [arrowhead=none]
eq -> U [arrowhead=none]
U -> S [label=" Γ— "]
S -> Vt [label=" Γ— "]
}
""")
st.markdown(
"""
**How it works:**
1. SVD decomposes the user-item matrix R into three matrices: **U Γ— Ξ£ Γ— Vα΅€**
2. **U** captures user preferences in k-dimensional "latent space"
3. **Ξ£** (Sigma) is a diagonal matrix of importance weights β€” the first few factors
capture the most important patterns, the rest are noise
4. **Vα΅€** captures item characteristics in the same latent space
5. We keep only the top **k** factors (truncated SVD) β€” this filters noise and
fills in the missing values with predictions
**Strengths:** Mathematically elegant, well-understood, fast with sparse matrices.
**Weaknesses:** Treats missing entries as zero (problematic for implicit feedback β€”
"didn't buy" β‰  "doesn't like").
"""
)
with st.expander("Step-by-step numerical walkthrough β€” SVD", expanded=False):
st.markdown(
"""
**Starting point β€” the user-item matrix (purchase counts):**
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Customer A** | 5 | 3 | ? | 1 |
| **Customer B** | 4 | ? | 2 | ? |
| **Customer C** | ? | 4 | 5 | 3 |
`?` = never purchased (stored as 0 in the matrix).
**Step 1 β€” Decompose the matrix**
SVD splits this matrix into three pieces: R = U Γ— Ξ£ Γ— Vα΅€
With k=2 factors, we get:
- **U** (3 users Γ— 2 factors) β€” each user as 2 numbers
- **Ξ£** (2 Γ— 2 diagonal) β€” importance weights
- **Vα΅€** (2 factors Γ— 4 items) β€” each item as 2 numbers
SVD computes this directly using linear algebra (no iterative training).
**Step 2 β€” The decomposition gives us vectors**
After SVD:
| User | Factor 1 | Factor 2 |
|---|---|---|
| A | 0.72 | -0.31 |
| B | 0.58 | -0.44 |
| C | 0.85 | 0.52 |
| Item | Factor 1 | Factor 2 |
|---|---|---|
| Cake Stand | 0.68 | -0.25 |
| Candles | 0.55 | 0.41 |
| Baking Set | 0.71 | 0.50 |
| Bunting | 0.43 | 0.38 |
**Step 3 β€” Predict missing values**
Customer B, Baking Set (was `?`):
`predicted = [0.58, -0.44] Β· [0.71, 0.50] = (0.58Γ—0.71) + (-0.44Γ—0.50) = 0.41 - 0.22 = 0.19`
Low score β†’ weak recommendation.
Customer A, Baking Set (was `?`):
`predicted = [0.72, -0.31] Β· [0.71, 0.50] = (0.72Γ—0.71) + (-0.31Γ—0.50) = 0.51 - 0.16 = 0.36`
Higher score β†’ stronger recommendation than for B.
**Key point:** SVD fills the `?` cells using the learned vectors. The math
ensures that users with similar purchase patterns get similar vectors, and
therefore similar predictions for unseen items.
**Limitation:** SVD treats every `?` as 0 during decomposition. So "never bought"
is treated the same as "bought 0 times" β€” which loses the distinction between
"not interested" and "never seen this product."
"""
)
# ── ALS ──────────────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### 2. ALS (Alternating Least Squares)")
st.markdown("*The industry standard for implicit feedback β€” used at Spotify and Netflix.*")
st.graphviz_chart("""
digraph als {
rankdir=TB
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
init [label="Initialize\\nRandom user & item\\nfactor matrices", fillcolor="#dbeafe", color="#3b82f6"]
fix_u [label="Fix User factors\\nOptimize Item factors\\n(least squares)", fillcolor="#d1fae5", color="#10b981"]
fix_i [label="Fix Item factors\\nOptimize User factors\\n(least squares)", fillcolor="#fce7f3", color="#ec4899"]
check [label="Check convergence\\n(error decreasing?)", fillcolor="#fef3c7", color="#f59e0b"]
done [label="Final U and V\\nPredict: U Γ— Vα΅€", fillcolor="#e0e7ff", color="#6366f1"]
init -> fix_u -> fix_i -> check
check -> fix_u [label="not converged", style=dashed]
check -> done [label="converged"]
}
""")
st.markdown(
"""
**How it works:**
1. Start with random user factors (U) and item factors (V)
2. **Fix U**, solve for the best V (this is a standard least squares problem)
3. **Fix V**, solve for the best U (again, least squares)
4. **Alternate** between steps 2 and 3 until the error stops decreasing
5. Multiply U Γ— Vα΅€ to get predicted scores for every user-item pair
**Why "Alternating"?** Optimizing U and V simultaneously is hard (non-convex).
But if you fix one, optimizing the other becomes a simple, solvable least-squares
problem. By alternating, we make steady progress toward a good solution.
**Key advantage for implicit feedback:** ALS treats observed interactions (purchases)
and unobserved ones (empty cells) differently. It assigns **confidence weights** β€”
a purchase of 10 units gets higher confidence than 1 unit, while empty cells get
low (but non-zero) confidence. This is exactly right for our e-commerce data.
**Strengths:** Designed for implicit data, parallelizable, scales to millions of users.
**Weaknesses:** More complex to implement, requires tuning regularization.
"""
)
with st.expander("Step-by-step numerical walkthrough β€” How ALS Actually Works", expanded=False):
st.markdown(
"""
ALS (Alternating Least Squares) is different from gradient descent methods.
Let me show you the real ALS algorithm with numbers.
**Starting point β€” same matrix:**
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Customer A** | 5 | 3 | ? | 1 |
| **Customer B** | 4 | ? | 2 | ? |
| **Customer C** | ? | 4 | 5 | 3 |
---
**Step 1 β€” Initialize random vectors (k=2 factors)**
| User | f1 | f2 |
|---|---|---|
| A | 0.3 | 0.7 |
| B | 0.5 | 0.2 |
| C | 0.1 | 0.9 |
| Item | f1 | f2 |
|---|---|---|
| Cake Stand | 0.4 | 0.6 |
| Candles | 0.8 | 0.3 |
| Baking Set | 0.2 | 0.7 |
| Bunting | 0.6 | 0.1 |
---
**Step 2 β€” Fix User vectors, Solve for ALL Item vectors at once**
This is where ALS is different! Instead of adjusting one vector at a time,
we solve for ALL item vectors simultaneously using a mathematical formula.
For **Cake Stand** (bought by A and B):
Users who bought it:
- Customer A: rating=5, vector=[0.3, 0.7]
- Customer B: rating=4, vector=[0.5, 0.2]
Build matrices:
```
U = [[0.3, 0.7], (users who bought Cake Stand)
[0.5, 0.2]]
r = [5, 4] (their ratings)
```
The closed-form solution:
"""
)
st.latex(r"v_{\text{CakeStand}} = (U^T U + \lambda I)^{-1} U^T r")
st.markdown(
"""
With regularization Ξ»=0.1:
```
U^T U = [[0.34, 0.31],
[0.31, 0.53]]
U^T U + Ξ»I = [[0.44, 0.31],
[0.31, 0.63]]
U^T r = [3.5, 4.3]
Invert and multiply:
v_CakeStand = [2.8, 5.1] ← New vector!
```
**Key difference:** We didn't use a learning rate or take small steps.
We solved the equation exactly using matrix math!
Do this for ALL items:
- Cake Stand β†’ [2.8, 5.1]
- Candles β†’ [1.9, 3.2]
- Baking Set β†’ [2.1, 4.8]
- Bunting β†’ [1.5, 2.9]
---
**Step 3 β€” Fix Item vectors, Solve for ALL User vectors at once**
Now flip it! Fix the new item vectors and solve for user vectors.
For **Customer A** (bought Cake Stand, Candles, Bunting):
Items they bought:
- Cake Stand: rating=5, vector=[2.8, 5.1]
- Candles: rating=3, vector=[1.9, 3.2]
- Bunting: rating=1, vector=[1.5, 2.9]
Build matrices:
```
V = [[2.8, 5.1], (items Customer A bought)
[1.9, 3.2],
[1.5, 2.9]]
r = [5, 3, 1] (Customer A's ratings)
```
Same formula:
"""
)
st.latex(r"u_A = (V^T V + \lambda I)^{-1} V^T r")
st.markdown(
"""
Result: `u_A = [0.82, 0.91]` ← New user vector!
Do this for ALL users:
- Customer A β†’ [0.82, 0.91]
- Customer B β†’ [0.73, 0.45]
- Customer C β†’ [0.31, 0.88]
---
**Step 4 β€” Alternate until convergence**
| Iteration | What happens | Error |
|---|---|---|
| 1 | Fix U, solve all V | 3.2 |
| 2 | Fix V, solve all U | 1.8 |
| 3 | Fix U, solve all V | 0.9 |
| 4 | Fix V, solve all U | 0.4 |
| 5 | Fix U, solve all V | 0.2 |
| 6 | Fix V, solve all U | 0.1 |
After ~6 iterations, vectors converge!
(Compare this to SGD which needs 100+ iterations)
---
**Step 5 β€” Make predictions**
Customer A β†’ Baking Set (`?`):
`predicted = [0.82, 0.91] Β· [2.1, 4.8] = 0.82Γ—2.1 + 0.91Γ—4.8 = 1.72 + 4.37 = 6.09`
Recommend Baking Set to Customer A!
---
**Why ALS is different:**
| Aspect | ALS | SGD (shown in next section) |
|---|---|---|
| **Updates** | All users at once (or all items at once) | One rating at a time |
| **Math** | Matrix inversion: `(A^T A)^(-1) A^T b` | Gradient descent: `v += Ξ± Γ— error Γ— u` |
| **Learning rate** | Not needed (exact solution) | Required (Ξ± = 0.01) |
| **Iterations** | Few (~10-20) | Many (100-200) |
| **Per iteration** | Slow (matrix operations) | Fast (simple arithmetic) |
| **Parallelization** | Easy (each user/item independent) | Hard (sequential updates) |
| **Best for** | Large-scale systems (Spotify, Netflix) | Small datasets, research |
---
**Simple analogy:**
**ALS:** You're a teacher grading 100 students. You collect ALL math tests,
grade them all at once using an answer key, then collect ALL English tests
and grade those. You alternate subjects, but grade all students per subject.
**SGD:** You grade one student's math test, then their English test, then
the next student's math test, then their English test. One at a time.
ALS is much faster when you have many students (users/items) because you
can split the grading across multiple teachers (parallel processing).
"""
)
# ── SGD ──────────────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### 3. SGD Matrix Factorization (Funk SVD)")
st.markdown("*The algorithm that powered the Netflix Prize β€” learns vectors one rating at a time.*")
st.graphviz_chart("""
digraph sgd {
rankdir=TB
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
init [label="Initialize random\\nuser vectors, item vectors,\\nuser biases, item biases", fillcolor="#dbeafe", color="#3b82f6"]
pick [label="Pick one observed\\nrating (u, i, r)", fillcolor="#e0e7ff", color="#6366f1"]
pred [label="Predict\\nμ + bᡀ + bᡒ + u⃗·v⃗", fillcolor="#d1fae5", color="#10b981"]
err [label="Error = actual βˆ’ predicted", fillcolor="#fef3c7", color="#f59e0b"]
upd [label="Nudge vectors & biases\\nin direction that reduces error", fillcolor="#fce7f3", color="#ec4899"]
check [label="More ratings\\nin this epoch?", fillcolor="#e0e7ff", color="#6366f1"]
epoch [label="More epochs?", fillcolor="#dbeafe", color="#3b82f6"]
done [label="Final prediction matrix\\nΞΌ + bα΅€ + bα΅’ + UΓ—Vα΅€", fillcolor="#d1fae5", color="#10b981"]
init -> pick -> pred -> err -> upd -> check
check -> pick [label="yes", style=dashed]
check -> epoch [label="no (epoch done)"]
epoch -> pick [label="yes", style=dashed]
epoch -> done [label="no (converged)"]
}
""")
st.markdown(
"""
**How it works:**
1. Start with random user vectors, item vectors, and bias terms
2. For each observed interaction, predict the score using the full formula:
**ΞΌ + bα΅€ + bα΅’ + user_vec Β· item_vec**
3. Compute the error (actual βˆ’ predicted)
4. Nudge the vectors and biases slightly to reduce that error (gradient descent)
5. Repeat for every observed interaction (one epoch), then repeat for many epochs
**Why this is different from pure SVD:**
- Pure SVD decomposes the *entire* matrix including zeros (missing = dislike)
- SGD only trains on *observed interactions* β€” it skips empty cells entirely
- SGD includes bias terms that anchor predictions to the right scale
**Why "Funk SVD"?** Simon Funk published this approach during the Netflix Prize
(2006). It's not technically SVD β€” it's gradient descent β€” but the name stuck.
This was the breakthrough that made matrix factorization practical for recommendation.
**Strengths:** Handles missing data correctly, includes bias terms, flexible, well-understood.
**Weaknesses:** Slow (one rating at a time), sensitive to learning rate, can overfit without regularization.
"""
)
with st.expander("Step-by-step numerical walkthrough β€” SGD (Funk SVD)", expanded=False):
st.markdown(
"""
**Simple explanation first:**
SGD (Stochastic Gradient Descent) learns by looking at one purchase at a time:
1. "Customer A bought 5 Cake Stands"
2. Predict how many they'd buy using current vectors: "I predict 3.9"
3. Calculate error: "I was off by 1.1 (too low)"
4. Adjust vectors slightly: "Make Customer A's vector a bit more like Cake Stand's vector"
5. Move to next purchase and repeat
After seeing all purchases many times (100+ epochs), the vectors learn the patterns.
**Key insight:** It's like learning to cook by tasting after each ingredient.
You add salt, taste, adjust. Add pepper, taste, adjust. Small corrections each time.
---
**Now the detailed math:**
**Starting point β€” same matrix, but we only use observed entries:**
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Customer A** | 5 | 3 | *skip* | 1 |
| **Customer B** | 4 | *skip* | 2 | *skip* |
| **Customer C** | *skip* | 4 | 5 | 3 |
Unlike SVD, the `?` cells are ignored during training β€” not treated as zero.
---
**Step 1 β€” Initialize everything**
Global mean: ΞΌ = (5+3+1+4+2+4+5+3) / 8 = **3.375**
Random biases (start near zero):
| | Bias |
|---|---|
| b_A | +0.01 |
| b_B | -0.02 |
| b_C | +0.01 |
| b_CakeStand | +0.03 |
| b_Candles | -0.01 |
| b_BakingSet | +0.02 |
| b_Bunting | -0.01 |
Random vectors (k=2):
| User | f1 | f2 |
|---|---|---|
| A | 0.30 | 0.70 |
| B | 0.50 | 0.20 |
| C | 0.10 | 0.90 |
| Item | f1 | f2 |
|---|---|---|
| Cake Stand | 0.40 | 0.60 |
| Candles | 0.80 | 0.30 |
| Baking Set | 0.20 | 0.70 |
| Bunting | 0.60 | 0.10 |
---
**Step 2 β€” Process first rating: Customer A β†’ Cake Stand (actual = 5)**
"""
)
st.latex(r"\text{pred} = \mu + b_A + b_{\text{CakeStand}} + \vec{u}_A \cdot \vec{v}_{\text{CakeStand}}")
st.markdown(
"""
`pred = 3.375 + 0.01 + 0.03 + (0.30Γ—0.40 + 0.70Γ—0.60)`
`= 3.375 + 0.01 + 0.03 + (0.12 + 0.42)`
`= 3.375 + 0.04 + 0.54 = 3.955`
`error = 5 - 3.955 = 1.045`
---
**Step 3 β€” Update using learning rate Ξ±=0.01 and regularization Ξ»=0.02**
"""
)
st.latex(r"b_u \leftarrow b_u + \alpha \cdot (\text{error} - \lambda \cdot b_u)")
st.latex(r"\vec{u} \leftarrow \vec{u} + \alpha \cdot (\text{error} \cdot \vec{v} - \lambda \cdot \vec{u})")
st.markdown(
"""
**Bias updates:**
`b_A = 0.01 + 0.01 Γ— (1.045 - 0.02 Γ— 0.01) = 0.01 + 0.01044 = 0.0204`
`b_CakeStand = 0.03 + 0.01 Γ— (1.045 - 0.02 Γ— 0.03) = 0.03 + 0.01044 = 0.0404`
**Vector updates:**
`A_f1 = 0.30 + 0.01 Γ— (1.045 Γ— 0.40 - 0.02 Γ— 0.30) = 0.30 + 0.00412 = 0.3041`
`A_f2 = 0.70 + 0.01 Γ— (1.045 Γ— 0.60 - 0.02 Γ— 0.70) = 0.70 + 0.00613 = 0.7061`
`CakeStand_f1 = 0.40 + 0.01 Γ— (1.045 Γ— 0.30 - 0.02 Γ— 0.40) = 0.40 + 0.00306 = 0.4031`
`CakeStand_f2 = 0.60 + 0.01 Γ— (1.045 Γ— 0.70 - 0.02 Γ— 0.60) = 0.60 + 0.00720 = 0.6072`
Small adjustments, but in the right direction. The regularization term
(Ξ» Γ— current_value) prevents weights from growing too large.
---
**Step 4 β€” Process next rating: Customer A β†’ Candles (actual = 3)**
Using updated vectors for A (but original vectors for Candles):
`pred = 3.375 + 0.0204 + (-0.01) + (0.3041Γ—0.80 + 0.7061Γ—0.30)`
`= 3.375 + 0.0104 + (0.2433 + 0.2118) = 3.375 + 0.0104 + 0.4551 = 3.840`
`error = 3 - 3.840 = -0.840` (prediction too high β€” nudge down)
Updates push vectors in the opposite direction this time.
---
**Step 5 β€” After one epoch (all 8 ratings processed)**
| Rating processed | Error |
|---|---|
| A β†’ Cake Stand | +1.045 |
| A β†’ Candles | -0.840 |
| A β†’ Bunting | -2.125 |
| B β†’ Cake Stand | +0.612 |
| B β†’ Baking Set | -1.389 |
| C β†’ Candles | +0.738 |
| C β†’ Baking Set | +1.455 |
| C β†’ Bunting | -0.204 |
Average absolute error β‰ˆ 1.05
---
**Step 6 β€” After many epochs**
| Epoch | Avg Absolute Error |
|---|---|
| 1 | 1.05 |
| 10 | 0.52 |
| 50 | 0.12 |
| 100 | 0.04 |
---
**Step 7 β€” Predict missing values**
Customer A β†’ Baking Set (`?`):
`pred = ΞΌ + b_A + b_BakingSet + u_A Β· v_BakingSet`
With trained values: `= 3.375 + 0.45 + 0.38 + 0.62 = 4.83` β†’ recommend!
Customer B β†’ Bunting (`?`):
`= 3.375 + (-0.15) + (-0.28) + 0.31 = 3.26` β†’ moderate recommendation
The bias terms are critical β€” they handle the difference between
heavy buyers (high b_u) and light buyers (low b_u), and popular
items (high b_i) versus niche items (low b_i).
"""
)
# ── Simple Comparison: ALS vs SGD ────────────────────────────────────────
st.markdown("---")
with st.expander("🎯 ALS vs SGD β€” Simple Comparison for Non-Technical Audience", expanded=False):
st.markdown(
"""
Both ALS and SGD learn user and item "profiles" (vectors) to make recommendations.
They both end up with similar results, but they learn differently.
---
### The Restaurant Analogy
**Scenario:** You're training staff to predict how much customers will tip.
You have 100 customers and 50 waiters. Each customer-waiter pair has a tip amount.
**SGD approach (One at a time):**
- Look at Customer 1 with Waiter A: tip was $5, you predicted $3
- "Oops, too low! Adjust Customer 1's 'generosity score' up a bit"
- "Also adjust Waiter A's 'service score' up a bit"
- Now look at Customer 1 with Waiter B: tip was $2, you predicted $4
- "Oops, too high! Adjust both scores down a bit"
- Continue one interaction at a time through all 5,000 customer-waiter pairs
- Repeat this process 100 times until predictions are accurate
**ALS approach (All at once, alternating):**
- Round 1: Fix all customer scores, solve for ALL waiter scores at once using math
- Round 2: Fix all waiter scores, solve for ALL customer scores at once using math
- Round 3: Fix all customer scores, solve for ALL waiter scores at once
- After ~10 rounds, you're done!
---
### Key Differences in Plain English
| Aspect | SGD | ALS |
|--------|-----|-----|
| **Learning style** | One interaction at a time, like studying flashcards | All at once per round, like grading all tests together |
| **Speed per round** | Fast (simple math) | Slow (complex matrix math) |
| **Rounds needed** | Many (100-200) | Few (10-20) |
| **Total time** | Moderate | Fast for large datasets |
| **Parallelization** | Hard (must go in order) | Easy (can split work across computers) |
| **Best for** | Small datasets, research, when you need bias terms | Large-scale production (Spotify, Netflix) |
---
### When to Use Which?
**Use SGD when:**
- You have a small dataset (< 100K interactions)
- You're doing research or prototyping
- You need explicit bias terms (ΞΌ + b_u + b_i)
- You want to understand the algorithm step-by-step
**Use ALS when:**
- You have a large dataset (millions of interactions)
- You're building a production system
- You have multiple servers/cores to parallelize
- Your data is implicit feedback (purchases, clicks, not ratings)
---
### The Math Difference (For Technical Readers)
**SGD update rule:**
```
error = actual - predicted
u ← u + Ξ± Γ— (error Γ— v - Ξ» Γ— u)
v ← v + Ξ± Γ— (error Γ— u - Ξ» Γ— v)
```
Small steps guided by the error gradient.
**ALS update rule:**
```
Fix U, solve for all V:
v_i = (U^T U + Ξ»I)^(-1) U^T r_i
Fix V, solve for all U:
u_j = (V^T V + Ξ»I)^(-1) V^T r_j
```
Closed-form solution using matrix inversion.
---
### Real-World Example
**Spotify's Discover Weekly (uses ALS):**
- 500 million users
- 100 million songs
- Billions of listening events per day
- Needs to update recommendations daily
- Solution: ALS parallelized across thousands of servers
**Netflix Prize (used SGD):**
- Research competition (2006)
- 500K users, 18K movies
- 100 million ratings (explicit 1-5 stars)
- Goal: Best accuracy, not fastest training
- Solution: SGD with carefully tuned learning rate and regularization
---
### Bottom Line
**SGD:** Like a student who learns by doing practice problems one at a time.
Slow but steady, easy to understand.
**ALS:** Like a teacher who grades all tests at once using an answer key.
Fewer iterations, but each iteration is more work. Much faster at scale.
Both get you to the same destination β€” accurate recommendations!
"""
)
# ── NMF ──────────────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### 4. NMF (Non-negative Matrix Factorization)")
st.markdown("*Matrix factorization with a twist: everything stays positive.*")
st.graphviz_chart("""
digraph nmf {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
R [label="R\\nUser-Item Matrix\\n(all values β‰₯ 0)", fillcolor="#dbeafe", color="#3b82f6"]
approx [label="β‰ˆ", shape=plaintext, fontsize=24]
W [label="W\\nUser factors\\n(all values β‰₯ 0)\\nadditive parts only", fillcolor="#d1fae5", color="#10b981"]
times [label="Γ—", shape=plaintext, fontsize=24]
H [label="H\\nItem factors\\n(all values β‰₯ 0)\\nadditive parts only", fillcolor="#fce7f3", color="#ec4899"]
R -> approx [arrowhead=none]
approx -> W [arrowhead=none]
W -> times [arrowhead=none]
times -> H [arrowhead=none]
}
""")
st.markdown(
"""
**How it's different from SVD:**
- SVD factors can be negative (hard to interpret: what does a "-0.5 kitchen" preference mean?)
- NMF forces **all values to be β‰₯ 0** β€” factors are purely additive
**Why non-negative is useful:**
- A user is "0.8 kitchen + 0.3 vintage + 0 party" β€” easy to interpret
- Each factor represents an additive "part" of the user's taste
- The results are **naturally sparse** β€” many factors are exactly zero
- This makes NMF factors more interpretable as "topics" or "themes"
For example, if SVD says "you like kitchen items but with a negative party component",
NMF says "you're 80% kitchen enthusiast, 30% vintage lover, and don't care about party
supplies." The NMF interpretation is more natural for recommendation.
**Strengths:** Interpretable factors, naturally sparse, good for parts-based decomposition.
**Weaknesses:** Only works with non-negative data, sensitive to initialization.
"""
)
with st.expander("Step-by-step numerical walkthrough β€” NMF", expanded=False):
st.markdown(
"""
NMF follows the same learn-vectors-from-data approach, but with one strict rule:
**every number must be β‰₯ 0**. No negatives allowed.
**Starting point β€” same matrix:**
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Customer A** | 5 | 3 | ? | 1 |
| **Customer B** | 4 | ? | 2 | ? |
| **Customer C** | ? | 4 | 5 | 3 |
All values are already β‰₯ 0 (purchase counts), so NMF can work with this directly.
---
**Step 1 β€” Create random non-negative vectors (k=2)**
| User | f1 ("kitchen") | f2 ("decor") |
|---|---|---|
| A | 0.8 | 0.2 |
| B | 0.6 | 0.3 |
| C | 0.1 | 0.9 |
| Item | f1 ("kitchen") | f2 ("decor") |
|---|---|---|
| Cake Stand | 0.9 | 0.1 |
| Candles | 0.3 | 0.7 |
| Baking Set | 0.2 | 0.8 |
| Bunting | 0.1 | 0.6 |
Notice: no negative values anywhere. The factors labeled "kitchen" and "decor"
are what the algorithm might discover β€” we don't name them, but after training
they often correspond to interpretable themes.
---
**Step 2 β€” Predict**
Customer A β†’ Cake Stand:
`predicted = [0.8, 0.2] Β· [0.9, 0.1] = 0.72 + 0.02 = 0.74`
Actual = 5. Error = 5 - 0.74 = 4.26 (large).
---
**Step 3 β€” Update rule (multiplicative, not additive)**
Here's what makes NMF different from SGD. Instead of adding/subtracting a gradient,
NMF uses **multiplicative updates** that guarantee values stay non-negative:
"""
)
st.latex(r"W_{ij} \leftarrow W_{ij} \times \frac{(R \cdot H^T)_{ij}}{(W \cdot H \cdot H^T)_{ij}}")
st.latex(r"H_{ij} \leftarrow H_{ij} \times \frac{(W^T \cdot R)_{ij}}{(W^T \cdot W \cdot H)_{ij}}")
st.markdown(
"""
The key insight: we **multiply** by a ratio. Since all values are positive,
the ratio is positive, so the result stays positive. If the ratio > 1,
the value increases; if < 1, it decreases. Values can approach zero but
never go negative.
---
**Step 4 β€” After many iterations**
| User | f1 ("kitchen") | f2 ("decor") |
|---|---|---|
| A | 2.1 | 0.0 |
| B | 1.7 | 0.4 |
| C | 0.0 | 2.3 |
| Item | f1 ("kitchen") | f2 ("decor") |
|---|---|---|
| Cake Stand | 2.3 | 0.1 |
| Candles | 0.5 | 1.6 |
| Baking Set | 0.2 | 2.1 |
| Bunting | 0.3 | 1.2 |
Notice how some values are exactly 0 β€” NMF produces **sparse factors**.
**Reading the result:**
- Customer A: [2.1, 0.0] β†’ purely a "kitchen" buyer, no "decor" interest
- Customer C: [0.0, 2.3] β†’ purely a "decor" buyer
- Cake Stand: [2.3, 0.1] β†’ almost entirely a "kitchen" item
- Baking Set: [0.2, 2.1] β†’ mostly a "decor" item
---
**Step 5 β€” Predict missing values**
Customer A β†’ Baking Set (`?`):
`predicted = [2.1, 0.0] Β· [0.2, 2.1] = 0.42 + 0.0 = 0.42` β†’ low score
Makes sense β€” A is a "kitchen" buyer, Baking Set is mostly "decor."
Customer C β†’ Cake Stand (`?`):
`predicted = [0.0, 2.3] Β· [2.3, 0.1] = 0.0 + 0.23 = 0.23` β†’ low score
Makes sense β€” C is a "decor" buyer, Cake Stand is mostly "kitchen."
---
**Why NMF factors are more interpretable:**
| Method | Factor example | Interpretation |
|---|---|---|
| SVD | [0.7, -0.3] | "0.7 of something, minus 0.3 of something else" β€” hard to explain |
| NMF | [2.1, 0.0] | "Strong kitchen interest, zero decor interest" β€” clear and additive |
Each factor in NMF represents a positive contribution. A user's taste is a
sum of parts, not a mix of additions and subtractions.
"""
)
# ── Item-Based CF ────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### 5. Item-Based Collaborative Filtering")
st.markdown("*No factorization β€” just find similar items.*")
st.graphviz_chart("""
digraph itemcf {
rankdir=TB
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
bought [label="Customer bought\\nCake Stand", fillcolor="#dbeafe", color="#3b82f6"]
sim [label="Find items with\\nsimilar purchase patterns\\n(cosine similarity)", fillcolor="#e0e7ff", color="#6366f1"]
items [label="Similar items:\\n1. Cake Tins (0.85)\\n2. Baking Set (0.79)\\n3. Cookie Cutters (0.72)", fillcolor="#d1fae5", color="#10b981"]
rec [label="Recommend\\nthese items!", fillcolor="#fef3c7", color="#f59e0b"]
bought -> sim -> items -> rec
}
""")
st.markdown(
"""
**How it works:**
1. For every pair of items, compute **cosine similarity** based on their purchase patterns
(items bought by similar sets of customers are similar)
2. When recommending for a user, look at items they've already bought
3. For each purchased item, find its most similar items
4. Rank by similarity score and recommend the top ones (excluding already-purchased)
**Why cosine similarity?** It measures the angle between two vectors, ignoring magnitude.
Two items with purchase patterns [1, 0, 1, 1] and [3, 0, 2, 4] have high cosine similarity
(same direction) even though the quantities differ.
**Strengths:** Simple, intuitive, no training needed, easy to explain.
**Weaknesses:** Doesn't discover latent factors, doesn't work well with very sparse data,
O(nΒ²) item-pair computation.
"""
)
with st.expander("Step-by-step numerical walkthrough β€” Item-Based CF", expanded=False):
st.markdown(
"""
Item-CF takes a completely different approach: no vectors, no training,
no latent factors. It works purely on similarity between items.
**Starting point β€” the same user-item matrix:**
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Customer A** | 5 | 3 | 0 | 1 |
| **Customer B** | 4 | 0 | 2 | 0 |
| **Customer C** | 0 | 4 | 5 | 3 |
(0 = not purchased)
---
**Step 1 β€” Represent each item as a vector of who bought it**
Each column is already a vector:
- Cake Stand: [5, 4, 0]
- Candles: [3, 0, 4]
- Baking Set: [0, 2, 5]
- Bunting: [1, 0, 3]
---
**Step 2 β€” Compute cosine similarity between every pair of items**
Cosine similarity measures the angle between two vectors (ignoring magnitude):
"""
)
st.latex(r"\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| \times |\vec{b}|}")
st.markdown(
"""
**Example β€” Cake Stand vs Candles:**
`dot product = (5Γ—3) + (4Γ—0) + (0Γ—4) = 15`
`|Cake Stand| = √(5² + 4² + 0²) = √41 = 6.40`
`|Candles| = √(3² + 0² + 4²) = √25 = 5.00`
`cosine = 15 / (6.40 Γ— 5.00) = 15 / 32.0 = 0.47`
**Example β€” Candles vs Bunting:**
`dot product = (3Γ—1) + (0Γ—0) + (4Γ—3) = 15`
`|Candles| = 5.00`
`|Bunting| = √(1² + 0² + 3²) = √10 = 3.16`
`cosine = 15 / (5.00 Γ— 3.16) = 15 / 15.81 = 0.95`
High similarity β€” Candles and Bunting are bought by the same types of customers.
---
**Step 3 β€” Build the similarity matrix**
After computing all pairs:
| | Cake Stand | Candles | Baking Set | Bunting |
|---|---|---|---|---|
| **Cake Stand** | 1.00 | 0.47 | 0.31 | 0.16 |
| **Candles** | 0.47 | 1.00 | 0.88 | 0.95 |
| **Baking Set** | 0.31 | 0.88 | 1.00 | 0.93 |
| **Bunting** | 0.16 | 0.95 | 0.93 | 1.00 |
Candles, Baking Set, and Bunting cluster together (high mutual similarity).
Cake Stand is somewhat different.
---
**Step 4 β€” Generate recommendations for Customer B**
Customer B bought: Cake Stand (4), Baking Set (2)
For each item B hasn't bought, score it by similarity to what B already bought:
**Candles:**
`score = (sim(Candles, Cake Stand) Γ— 4) + (sim(Candles, Baking Set) Γ— 2)`
`= (0.47 Γ— 4) + (0.88 Γ— 2) = 1.88 + 1.76 = 3.64`
**Bunting:**
`score = (sim(Bunting, Cake Stand) Γ— 4) + (sim(Bunting, Baking Set) Γ— 2)`
`= (0.16 Γ— 4) + (0.93 Γ— 2) = 0.64 + 1.86 = 2.50`
**Ranking:** Candles (3.64) > Bunting (2.50)
β†’ Recommend Candles first.
---
**Why this works without any training:**
There are no vectors to learn. The similarity matrix is computed directly
from the data. When the data changes (new purchases), you just recompute
the similarity scores β€” no model to retrain.
**Trade-off:** This simplicity comes at a cost. With 10,000 products, you
need to compute 50 million similarity pairs. Matrix factorization methods
reduce each item to k numbers (e.g., 50), making everything faster.
"""
)
# ── Comparison Table ─────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### Quick Comparison")
st.markdown(
"""
| | SVD | ALS | SGD (Funk SVD) | NMF | Item-Based CF |
|---|---|---|---|---|---|
| **Approach** | Decompose R = UΞ£Vα΅€ | Alternate optimizing U and V | Gradient descent + bias terms | Decompose R β‰ˆ WΓ—H (non-negative) | Find similar items via cosine |
| **Feedback type** | Explicit (treats 0 as 0) | Implicit (0 = unknown, not dislike) | Either (skips missing entries) | Non-negative values | Either |
| **Bias terms** | No | Yes (implicit) | Yes (ΞΌ + bα΅€ + bα΅’) | No | No |
| **Interpretability** | Low (negative factors) | Low | Low | High (additive parts) | Very high (similar items) |
| **Scalability** | Good | Excellent (parallelizable) | Moderate (sequential updates) | Moderate | Poor for many items |
| **Used by** | Research, baselines | Spotify, Netflix, LinkedIn | Netflix Prize, baselines | Topic modeling, parts decomposition | Amazon ("customers also bought") |
"""
)
# ── Why Each Model ────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### Why We Chose Each Model")
st.markdown(
"""
#### SVD β€” The Mathematical Baseline
| | |
|---|---|
| **Why we included it** | Same role as Logistic Regression in churn β€” the simplest factorization approach. If other models can't beat it, their complexity isn't adding value. |
| **Where it's used** | Dimensionality reduction (PCA is SVD under the hood), text analysis (Latent Semantic Analysis), image compression. In recommendations: primarily as a baseline. |
| **Key benefit** | One-step computation with no iterative training. Mathematically exact decomposition. |
| **Limitation** | Treats every empty cell as zero β€” "didn't buy" is treated as "dislikes." This is a fundamental problem for e-commerce data. |
| **When to pick this** | When you have explicit ratings (1–5 stars) with few missing values, or as a quick baseline before trying more sophisticated methods. |
---
#### ALS β€” The Industry Standard
| | |
|---|---|
| **Why we included it** | It's the go-to algorithm for implicit feedback (purchases, clicks, views) β€” which is exactly what our e-commerce data is. |
| **Where it's used** | Spotify (Discover Weekly playlists), Netflix (movie recommendations), LinkedIn (job recommendations), Apple Music. |
| **Key benefit** | Designed specifically for implicit data. Treats "didn't buy" as "unknown" (not "dislike"). Highly parallelizable β€” scales to billions of interactions. |
| **Limitation** | Harder to implement from scratch. Requires careful tuning of regularization and confidence weights. |
| **When to pick this** | When your data is implicit feedback (purchases, clicks, views) and you need to scale to large datasets. This is the default choice for most recommendation systems. |
---
#### SGD (Funk SVD) β€” The Netflix Prize Winner
| | |
|---|---|
| **Why we included it** | It's the algorithm that proved matrix factorization works for recommendations (Netflix Prize, 2006). It introduces bias terms that the other methods lack. |
| **Where it's used** | The Netflix Prize competition. Research baselines. Systems where explicit ratings exist. Foundation for modern embedding-based models. |
| **Key benefit** | Only trains on *observed* interactions (skips missing entries). Bias terms (ΞΌ + bα΅€ + bα΅’) anchor predictions to the correct scale. Flexible and well-understood. |
| **Limitation** | Sequential updates (one rating at a time) β€” hard to parallelize. Slower than ALS on large datasets. Sensitive to learning rate. |
| **When to pick this** | When you have explicit ratings and want the most theoretically grounded approach. Or when you need bias terms for better calibration. |
---
#### NMF β€” The Interpretable Option
| | |
|---|---|
| **Why we included it** | It produces factors you can actually explain β€” "this customer is 80% kitchen enthusiast, 20% vintage lover" β€” because all values are non-negative. |
| **Where it's used** | Topic modeling in text analysis (discovering themes in documents), audio source separation, parts-based image decomposition, recommendation with interpretability requirements. |
| **Key benefit** | The non-negativity constraint produces sparse, additive factors. Each factor represents a "part" of the user's taste. Easier to explain to business stakeholders than SVD or ALS. |
| **Limitation** | Only works with non-negative data. Treats missing values as zero (same problem as SVD). Sensitive to initialization. |
| **When to pick this** | When you need to explain *why* an item was recommended in terms of interpretable themes, or when factor analysis is a goal alongside recommendation. |
---
#### Item-Based CF β€” The Simplest Approach
| | |
|---|---|
| **Why we included it** | It's fundamentally different from the other four (no matrix factorization at all). Including it shows whether learning latent vectors actually helps versus just using raw similarity. |
| **Where it's used** | Amazon ("Customers who bought this also bought..."), e-commerce product pages, any "similar items" widget. |
| **Key benefit** | No training needed β€” just compute cosine similarity. Easy to explain: "We recommend this because people who bought what you bought also bought this." New items with a few purchases can be recommended immediately. |
| **Limitation** | Requires computing similarity between every pair of items β€” O(nΒ²). Doesn't discover hidden patterns. Struggles with very sparse data. |
| **When to pick this** | When you need a fast, explainable system without a training pipeline, or for "similar items" sections on product pages. |
---
#### How They Complement Each Other
| Role | Model | Why |
|---|---|---|
| **Baseline** | SVD | Sets the floor β€” simplest factorization |
| **Production model** | ALS | Best for implicit feedback at scale |
| **Calibrated predictions** | SGD | Bias terms give better-scaled scores |
| **Interpretable factors** | NMF | When you need to explain *why* |
| **Product page recs** | Item-CF | "Customers also bought" β€” no training needed |
| **Live updates** | ALS (Live page) | Fast enough for repeated retraining |
In a real deployment, you might use ALS for homepage "recommended for you,"
Item-CF for product pages ("similar items"), and NMF for internal analysis
of customer segments.
"""
)
# ── Full Picture ─────────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### The Complete Training Pipeline")
st.graphviz_chart("""
digraph pipeline {
rankdir=TB
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
s1 [label="Start with random vectors\\n(user vectors + item vectors)", fillcolor="#dbeafe", color="#3b82f6"]
s2 [label="Predict a rating\\nμ + b_user + b_item + (u⃗ · v⃗)", fillcolor="#d1fae5", color="#10b981"]
s3 [label="Calculate error\\n(actual βˆ’ predicted)", fillcolor="#fef3c7", color="#f59e0b"]
s4 [label="Adjust vectors\\nto reduce error", fillcolor="#fce7f3", color="#ec4899"]
s5 [label="Repeat for all ratings\\n(1 epoch)", fillcolor="#e0e7ff", color="#6366f1"]
s6 [label="Repeat for many epochs\\nuntil error is small", fillcolor="#dbeafe", color="#3b82f6"]
s7 [label="Vectors are now trained\\nUse them to predict missing values", fillcolor="#d1fae5", color="#10b981"]
s1 -> s2 -> s3 -> s4 -> s5 -> s6 -> s7
s6 -> s2 [label="next epoch", style=dashed, constraint=false]
}
""")
st.markdown(
"""
**Where each algorithm differs:**
| Method | How it adjusts vectors (Step 4) | Bias terms? | Skips missing entries? |
|---|---|---|---|
| **SGD (Funk SVD)** | Adjust one rating at a time using gradient descent | Yes (ΞΌ + bα΅€ + bα΅’) | Yes β€” only trains on observed data |
| **ALS** | Fix all user vectors, solve all item vectors exactly (then swap) | Yes (ΞΌ + bα΅€ + bα΅’) | Yes β€” uses confidence weights |
| **SVD** | No training loop β€” computes vectors mathematically in one step | No (operates on raw matrix) | No β€” treats missing as 0 |
| **NMF** | Multiplicative update: multiply by a ratio to stay non-negative | No (factors absorb the scale) | No β€” treats missing as 0 |
| **Item-CF** | No vectors at all β€” computes similarity directly from purchase patterns | No (score is weighted similarity sum) | N/A |
All matrix factorization methods share the same goal: **find vectors for users and items
such that their dot product approximates the observed ratings/purchase counts.**
They just use different math to get there.
"""
)
# ═══════════════════════════════════════════════════════════════════════════════
# Tab 2 β€” Modern Architecture & Industry
# ═══════════════════════════════════════════════════════════════════════════════
with tab_modern:
st.subheader("Modern Recommendation Architecture")
st.markdown(
"The algorithms we've built (SVD, SGD, ALS, NMF) are the **foundation** of recommendation systems. "
"Here's how they evolved into what companies like YouTube, Spotify, and Netflix use today."
)
# ── Two-Stage Architecture ───────────────────────────────────────────────
st.markdown("---")
st.markdown("### Two-Stage Architecture")
st.markdown(
"Modern production recommenders don't use a single model. They split the problem "
"into two stages to handle scale (millions of items) efficiently."
)
st.graphviz_chart("""
digraph twostage {
rankdir=TB
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
catalog [label="Full Product Catalog\\n(millions of items)", fillcolor="#dbeafe", color="#3b82f6"]
subgraph cluster_stage1 {
label="Stage 1 β€” Candidate Generation"
style=dashed
color="#6366f1"
fontname="Helvetica"
fontsize=11
tower [label="Two-Tower Neural Network\\nUser tower encodes user\\nItem tower encodes item", fillcolor="#e0e7ff", color="#6366f1"]
embed [label="Embedding Similarity\\n(fast approximate\\nnearest neighbor search)", fillcolor="#e0e7ff", color="#6366f1"]
cands [label="~1,000 Candidate Items\\n(rough filter, fast)", fillcolor="#fce7f3", color="#ec4899"]
tower -> cands
embed -> cands
}
subgraph cluster_stage2 {
label="Stage 2 β€” Ranking Model"
style=dashed
color="#10b981"
fontname="Helvetica"
fontsize=11
features [label="Rich Features\\nuser profile + item metadata\\n+ context + history", fillcolor="#d1fae5", color="#10b981"]
dnn [label="Deep Neural Network\\nscores each candidate", fillcolor="#d1fae5", color="#10b981"]
ranked [label="Top 10-50 items\\n(precise ranking)", fillcolor="#fef3c7", color="#f59e0b"]
features -> dnn -> ranked
}
catalog -> tower
catalog -> embed
cands -> features
}
""")
st.markdown(
"""
| Stage | Purpose | Speed vs Quality | What we built |
|---|---|---|---|
| **Candidate Generation** | Quickly narrow millions of items to ~1,000 plausible candidates | Fast, lower precision | Our SVD/SGD/ALS/NMF β€” they produce a rough score for every item |
| **Ranking** | Precisely rank those ~1,000 using rich features | Slower, high precision | Not in our demo β€” requires deep learning infrastructure |
**Why two stages?** Scoring all items with a complex deep network is too slow.
The first stage (where our algorithms live) acts as a fast filter. The second
stage applies a heavyweight model only to the surviving candidates.
"""
)
# ── What Matrix Factorization Became ─────────────────────────────────────
st.markdown("---")
st.markdown("### How Matrix Factorization Evolved Into Embeddings")
st.graphviz_chart("""
digraph evolution {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
mf [label="Matrix Factorization\\n(2006-2015)\\n\\nuser_vector (k dims)\\nitem_vector (k dims)\\n\\nScore = dot product", fillcolor="#dbeafe", color="#3b82f6"]
embed [label="Embedding Learning\\n(2015-present)\\n\\nuser_embedding (k dims)\\nitem_embedding (k dims)\\n\\nLearned inside neural nets", fillcolor="#d1fae5", color="#10b981"]
deep [label="Deep Recommender\\n(current state)\\n\\nuser_embedding\\nitem_embedding\\n+ context features\\n+ sequence history\\n+ non-linear layers", fillcolor="#fef3c7", color="#f59e0b"]
mf -> embed [label=" same concept,\\n new training"]
embed -> deep [label=" add more\\n signals"]
}
""")
st.markdown(
"""
Matrix factorization didn't disappear β€” it became the foundation of modern embeddings:
| Concept | Matrix Factorization (what we built) | Deep Recommender (industry today) |
|---|---|---|
| **User representation** | User vector from SGD/ALS (k numbers) | User embedding learned by neural network |
| **Item representation** | Item vector from SGD/ALS (k numbers) | Item embedding learned by neural network |
| **Score computation** | Dot product: `user_vec Β· item_vec` | Neural network with multiple layers |
| **Input data** | Interactions only (purchases, clicks) | Interactions + user profile + item metadata + time + context |
| **Relationships** | Linear (dot product) | Non-linear (neural network can learn complex patterns) |
| **New items** | Cannot handle (no interactions = no vector) | Can use item features (title, category, image) |
The user/item vectors from our ALS model are conceptually identical to the
embeddings inside a modern deep recommender β€” the difference is how they're
trained and what additional signals they incorporate.
"""
)
# ── YouTube Example ──────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### Real Industry Example β€” YouTube")
st.markdown(
"YouTube's recommendation system (described in their 2016 paper *Deep Neural Networks "
"for YouTube Recommendations*) follows the exact two-stage pattern:"
)
st.graphviz_chart("""
digraph youtube {
rankdir=TB
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
corpus [label="YouTube Video Corpus\\n(hundreds of millions of videos)", fillcolor="#dbeafe", color="#3b82f6"]
subgraph cluster_yt1 {
label="Candidate Generation"
style=dashed
color="#6366f1"
fontname="Helvetica"
yt_tower [label="Two-Tower DNN\\n\\nInputs: watch history,\\nsearch history, demographics\\n\\nOutput: user embedding", fillcolor="#e0e7ff", color="#6366f1"]
yt_ann [label="Approximate Nearest\\nNeighbor Search\\n\\n→ ~hundreds of\\nvideo candidates", fillcolor="#fce7f3", color="#ec4899"]
}
subgraph cluster_yt2 {
label="Ranking"
style=dashed
color="#10b981"
fontname="Helvetica"
yt_features [label="Hundreds of Features\\n\\nvideo age, channel quality,\\nuser language, device,\\ntime of day, watch %...", fillcolor="#d1fae5", color="#10b981"]
yt_rank [label="Deep Ranking Network\\n\\nPredicts: P(watch)\\nP(like), P(share)...", fillcolor="#fef3c7", color="#f59e0b"]
}
yt_result [label="Top ~20 videos\\nshown to user", fillcolor="#d1fae5", color="#10b981"]
corpus -> yt_tower -> yt_ann -> yt_features -> yt_rank -> yt_result
}
""")
st.markdown(
"""
| Component | YouTube's Approach | Our Demo Equivalent |
|---|---|---|
| User representation | Learned embedding from watch/search history | User vector from SGD/ALS/SVD |
| Item representation | Video embedding from features | Item vector from SGD/ALS/SVD |
| Candidate generation | Two-tower DNN + ANN search | Our full prediction matrix |
| Ranking | Deep network with hundreds of features | Not in our demo |
| Scale | Hundreds of millions of videos, billions of users | 3,665 items, 4,338 users |
"""
)
# ── Limitations of MF ────────────────────────────────────────────────────
st.markdown("---")
st.markdown("### Limitations of Matrix Factorization vs Deep Learning")
st.markdown(
"""
| Limitation | Matrix Factorization (our models) | Deep Learning Solution |
|---|---|---|
| **Only uses interactions** | Can only learn from purchases β€” knows nothing about the product itself | Can incorporate item metadata (description, category, price, images) |
| **Ignores context** | Same recommendation regardless of time, device, or situation | Uses context: time of day, device, location, season |
| **No sequence awareness** | Treats all purchases equally β€” doesn't know the order | Sequence models (RNN, Transformer) understand "watched A, then B, likely wants C" |
| **Linear relationships** | Score = dot product (linear combination of factors) | Neural networks learn non-linear patterns and complex interactions |
| **Cold start** | New item with no purchases = no vector = cannot recommend | Can use item features to generate embedding even with zero interactions |
| **Static** | User vector doesn't change between retraining | Real-time models update user embedding with every interaction |
"""
)
st.markdown("---")
st.markdown("### Where Our Demo Fits")
st.graphviz_chart("""
digraph spectrum {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.2, style=dashed]
pop [label="Popularity\\nBaseline\\n(recommend\\nbest-sellers)", fillcolor="#fee2e2", color="#ef4444"]
cf [label="Item-Based CF\\n(cosine similarity)\\n\\nOur demo βœ“", fillcolor="#fef3c7", color="#f59e0b"]
mf [label="Matrix Factorization\\n(SVD, SGD, ALS, NMF)\\n\\nOur demo βœ“", fillcolor="#d1fae5", color="#10b981"]
fm [label="Factorization\\nMachines\\n(adds features\\nto MF)", fillcolor="#e0e7ff", color="#6366f1"]
deep [label="Deep Learning\\n(two-tower,\\nTransformer)\\n\\nIndustry today", fillcolor="#dbeafe", color="#3b82f6"]
pop -> cf -> mf -> fm -> deep
}
""")
st.markdown(
"""
Our demo implements the **middle of the spectrum** β€” the algorithms that
established the field and are still used as candidate generators in production
systems. Matrix factorization remains the conceptual backbone: the user and item
vectors we learn with ALS are the direct ancestors of the embeddings inside
YouTube's, Spotify's, and Netflix's deep recommenders.
**For a data science project, this is the right starting point.** You demonstrate
the core concepts (latent factors, matrix decomposition, bias terms, implicit feedback),
then scale up to deep learning when the business case justifies the infrastructure.
"""
)
st.markdown("---")
st.info(
"**Note:** The deep learning model described above (YouTube's **Two-Tower Deep Neural Network** "
"with separate Candidate Generation and Ranking stages) is **explained but not implemented** "
"in this demo. It requires TensorFlow/PyTorch, GPU resources, and significantly more data "
"(YouTube trains on billions of interactions β€” our dataset has ~215K). Our 5 implemented models "
"(SVD, ALS, SGD, NMF, Item-CF) are the foundational algorithms that YouTube's deep system evolved from."
)
# ═══════════════════════════════════════════════════════════════════════════════
# Tab 3 β€” Model Comparison
# ═══════════════════════════════════════════════════════════════════════════════
with tab_compare:
st.subheader("Model Performance Comparison")
st.markdown(
"We train all five algorithms on the same training set and evaluate "
"on held-out test interactions using Precision@10, Recall@10, and Hit Rate."
)
interactions, user_map, item_map, item_desc, _ = build_interaction_matrix()
train_df, train_sparse, test_df, _, _, _ = get_rec_train_test()
train_dense = train_df.values.astype(np.float32)
user_index = {cust: i for i, cust in enumerate(train_df.index)}
item_columns = list(train_df.columns)
n_factors = st.slider("Number of latent factors (k)", 10, 100, 50, step=10)
st.caption(
"πŸ’‘ **Note:** Training results are cached by Streamlit. When you change k and click train, "
"the models will retrain with the new value. Subsequent clicks with the same k will be instant (cached). "
"Watch the timing indicators to see actual training vs. cache hits."
)
# Show if results exist from previous training
if "rec_results" in st.session_state and "rec_n_factors" in st.session_state:
prev_k = st.session_state.rec_n_factors
if prev_k == n_factors:
st.info(f"πŸ“Š Results from previous training with k={n_factors} are shown below. Click train button to refresh.")
else:
st.warning(f"⚠️ Current results are from k={prev_k}. Click train button to retrain with k={n_factors}.")
if st.button("Train & Evaluate All Models", type="primary", use_container_width=True):
import time
start_time = time.time()
results = {}
progress_bar = st.progress(0, text="Starting training...")
with st.spinner("Training SVD..."):
t0 = time.time()
svd_pred, *_ = train_svd(train_sparse, n_factors=n_factors)
results["SVD"] = evaluate_recommendations(svd_pred, train_sparse, test_df, user_index, item_columns)
elapsed = time.time() - t0
st.success(f"SVD trained ({elapsed:.2f}s)")
progress_bar.progress(0.2, text="SVD complete, training ALS...")
with st.spinner("Training ALS..."):
t0 = time.time()
als_model = train_als(train_sparse, n_factors=n_factors)
als_user = als_model.user_factors
als_item = als_model.item_factors
als_pred = als_user @ als_item.T
results["ALS"] = evaluate_recommendations(als_pred, train_sparse, test_df, user_index, item_columns)
elapsed = time.time() - t0
st.success(f"ALS trained ({elapsed:.2f}s)")
progress_bar.progress(0.4, text="ALS complete, training SGD...")
with st.spinner("Training SGD (Funk SVD) β€” this may take ~60s on first run..."):
t0 = time.time()
sgd_pred, *_ = train_sgd(train_sparse, n_factors=n_factors, n_epochs=30)
results["SGD"] = evaluate_recommendations(sgd_pred, train_sparse, test_df, user_index, item_columns)
elapsed = time.time() - t0
st.success(f"SGD trained ({elapsed:.2f}s)")
progress_bar.progress(0.6, text="SGD complete, training NMF...")
with st.spinner("Training NMF..."):
t0 = time.time()
nmf_pred, *_ = train_nmf(train_dense, n_factors=n_factors)
results["NMF"] = evaluate_recommendations(nmf_pred, train_sparse, test_df, user_index, item_columns)
elapsed = time.time() - t0
st.success(f"NMF trained ({elapsed:.2f}s)")
progress_bar.progress(0.8, text="NMF complete, training Item-CF...")
with st.spinner("Training Item-Based CF..."):
t0 = time.time()
cf_pred, _ = train_item_cf(train_sparse)
results["Item-CF"] = evaluate_recommendations(cf_pred, train_sparse, test_df, user_index, item_columns)
elapsed = time.time() - t0
st.success(f"Item-Based CF trained ({elapsed:.2f}s)")
progress_bar.progress(1.0, text="All models trained!")
total_time = time.time() - start_time
st.info(f"βœ… All 5 models trained with **k={n_factors} latent factors** in {total_time:.2f}s total")
# Store in session state for persistence
st.session_state.rec_results = results
st.session_state.rec_n_factors = n_factors
st.session_state.rec_svd_pred = svd_pred
st.session_state.rec_als_pred = als_pred
st.session_state.rec_sgd_pred = sgd_pred
st.session_state.rec_nmf_pred = nmf_pred
st.session_state.rec_cf_pred = cf_pred
st.session_state["rec_models_trained"] = True
st.session_state["svd_pred"] = svd_pred
st.session_state["als_pred"] = als_pred
st.session_state["sgd_pred"] = sgd_pred
st.session_state["nmf_pred"] = nmf_pred
st.session_state["cf_pred"] = cf_pred
st.session_state["train_df"] = train_df
st.session_state["item_desc"] = item_desc
st.session_state["item_map"] = item_map
st.session_state["user_map"] = user_map
st.rerun()
# Display results and metrics explanation section
if "rec_results" in st.session_state and st.session_state.get("rec_models_trained", False):
st.markdown("---")
st.markdown("### Results")
results = st.session_state.rec_results
k_used = st.session_state.rec_n_factors
st.caption(f"Results from training with k={k_used} latent factors")
metrics_df = pd.DataFrame(results).T
metrics_df = metrics_df[["Precision@K", "Recall@K", "Hit Rate", "RMSE", "RΒ²", "Users Evaluated"]]
# Format for display
display_df = metrics_df.copy()
display_df["Precision@K"] = display_df["Precision@K"].map("{:.4f}".format)
display_df["Recall@K"] = display_df["Recall@K"].map("{:.4f}".format)
display_df["Hit Rate"] = display_df["Hit Rate"].map("{:.2%}".format)
display_df["RMSE"] = display_df["RMSE"].map("{:.2f}".format)
display_df["RΒ²"] = display_df["RΒ²"].map("{:.4f}".format)
display_df["Users Evaluated"] = display_df["Users Evaluated"].apply(lambda x: str(int(x)))
st.dataframe(display_df, use_container_width=True)
st.info(
"**Metric Guide:**\n\n"
"- **Precision@K**: Of the K items recommended, what % were actually purchased? (Higher is better)\n"
"- **Recall@K**: Of all items the user purchased, what % did we recommend? (Higher is better)\n"
"- **Hit Rate**: % of users who got at least one relevant recommendation (Higher is better)\n"
"- **RMSE**: Root Mean Squared Error - measures prediction accuracy for quantities (Lower is better)\n"
"- **RΒ²**: How well the model explains variance in purchase quantities (1.0 = perfect, 0 = baseline, negative = worse than baseline)"
)
raw_results = pd.DataFrame(results).T
# Create two separate charts: one for ranking metrics, one for prediction metrics
col1, col2 = st.columns(2)
with col1:
st.markdown("**Ranking Metrics** (Higher is Better)")
fig_ranking = go.Figure()
for metric in ["Precision@K", "Recall@K", "Hit Rate"]:
fig_ranking.add_trace(go.Bar(
name=metric,
x=list(results.keys()),
y=raw_results[metric].values,
text=[f"{v:.4f}" for v in raw_results[metric].values],
textposition="auto",
))
fig_ranking.update_layout(
barmode="group",
yaxis_title="Score",
height=400,
showlegend=True,
)
st.plotly_chart(fig_ranking, use_container_width=True)
with col2:
st.markdown("**Prediction Metrics**")
fig_pred = go.Figure()
# RMSE (lower is better) - show on left y-axis
fig_pred.add_trace(go.Bar(
name="RMSE (lower is better)",
x=list(results.keys()),
y=raw_results["RMSE"].values,
text=[f"{v:.2f}" for v in raw_results["RMSE"].values],
textposition="auto",
marker_color="lightcoral",
))
fig_pred.update_layout(
yaxis_title="RMSE (lower is better)",
height=400,
showlegend=True,
)
st.plotly_chart(fig_pred, use_container_width=True)
# RΒ² in a separate small chart
st.markdown("**RΒ² Score** (1.0 = perfect, 0 = baseline)")
fig_r2 = go.Figure()
fig_r2.add_trace(go.Bar(
x=list(results.keys()),
y=raw_results["RΒ²"].values,
text=[f"{v:.3f}" for v in raw_results["RΒ²"].values],
textposition="auto",
marker_color="lightblue",
showlegend=False,
))
fig_r2.update_layout(
yaxis_title="RΒ² Score",
height=300,
)
st.plotly_chart(fig_r2, use_container_width=True)
st.markdown("---")
st.markdown("### Understanding These Metrics")
st.markdown(
"""
#### How We Test
Before training, we **hide** some items each customer actually purchased (the test set).
After training, we ask each model: *"What are your top 10 recommendations for this customer?"*
Then we check how many of those recommendations match the hidden items.
"""
)
st.graphviz_chart("""
digraph eval {
rankdir=LR
node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"]
edge [color="#888888", penwidth=1.5]
full [label="Customer's full\\npurchase history\\n(100 items)", fillcolor="#dbeafe", color="#3b82f6"]
train [label="Training set\\n(80 items)\\nModel sees these", fillcolor="#d1fae5", color="#10b981"]
test [label="Test set\\n(20 items)\\nHidden from model", fillcolor="#fee2e2", color="#ef4444"]
rec [label="Model's top 10\\nrecommendations", fillcolor="#fef3c7", color="#f59e0b"]
check [label="How many of the 10\\nare in the hidden 20?", fillcolor="#e0e7ff", color="#6366f1"]
full -> train
full -> test
train -> rec [label=" model trained"]
rec -> check
test -> check [label=" compare"]
}
""")
st.markdown(
"""
#### Example β€” Customer #12345
This customer bought 100 products total. We hide 20 of them for testing.
The model trains on the remaining 80. Then we ask for 10 recommendations.
| Model's Top 10 Recommendations | In the hidden 20? |
|---|---|
| 1. Cake Stand | No |
| 2. Vintage Teacup Set | **Yes** (was in hidden test set) |
| 3. Baking Tins | No |
| 4. Candle Holder | No |
| 5. Recipe Book Stand | **Yes** (was in hidden test set) |
| 6. Napkin Set | No |
| 7. Cookie Cutters | No |
| 8. Storage Jar | No |
| 9. Tea Towel | No |
| 10. Egg Cup Set | No |
**2 out of 10** recommendations matched hidden purchases.
---
#### What Each Metric Means for This Customer
"""
)
st.latex(r"\text{Precision@10} = \frac{\text{correct recommendations}}{\text{total recommendations}} = \frac{2}{10} = 0.20")
st.markdown(
"""
*"20% of what we recommended was actually relevant."*
If you showed these 10 products on a homepage, 2 of them would be items
the customer genuinely wanted. The other 8 might still be interesting
(the customer just hasn't bought them *yet*), but we can only verify the ones in our test set.
"""
)
st.latex(r"\text{Recall@10} = \frac{\text{correct recommendations}}{\text{total hidden items}} = \frac{2}{20} = 0.10")
st.markdown(
"""
*"We found 10% of the items the customer actually wanted."*
The customer had 20 hidden items we needed to find. We found 2 of them in our top 10.
With only 10 recommendation slots and 3,665 total products, finding even 2 out of 20
is meaningful β€” pure random guessing would find ~0.05 items.
"""
)
st.markdown(
"""
**Hit Rate** asks a simpler question: *"Did the customer get at least one good recommendation?"*
For this customer, the answer is **Yes** (2 hits > 0). If we repeat this across all 4,080
evaluated customers:
"""
)
st.latex(r"\text{Hit Rate} = \frac{\text{customers with} \geq 1 \text{ hit}}{\text{total customers evaluated}}")
st.markdown(
"""
---
#### Full Calculation β€” All 3 Metrics Across 5 Customers
Here's how the final numbers are computed. Suppose we evaluate 5 customers:
| Customer | Hidden Items | Model's Top 10 | Hits | Precision@10 | Recall@10 | Hit? |
|---|---|---|---|---|---|---|
| A | 20 items | 2 matched | 2 | 2/10 = 0.20 | 2/20 = 0.10 | Yes |
| B | 8 items | 0 matched | 0 | 0/10 = 0.00 | 0/8 = 0.00 | No |
| C | 15 items | 1 matched | 1 | 1/10 = 0.10 | 1/15 = 0.07 | Yes |
| D | 5 items | 0 matched | 0 | 0/10 = 0.00 | 0/5 = 0.00 | No |
| E | 12 items | 3 matched | 3 | 3/10 = 0.30 | 3/12 = 0.25 | Yes |
"""
)
st.latex(r"\text{Precision@10} = \text{average} = \frac{0.20 + 0.00 + 0.10 + 0.00 + 0.30}{5} = 0.12")
st.latex(r"\text{Recall@10} = \text{average} = \frac{0.10 + 0.00 + 0.07 + 0.00 + 0.25}{5} = 0.084")
st.latex(r"\text{Hit Rate} = \frac{\text{3 customers with hits}}{\text{5 total}} = 60\%")
st.markdown(
"""
In our actual evaluation, this calculation runs across **~4,080 customers** instead of 5.
**How to present these numbers:**
- Precision@10 = 0.12 means "on average, 1.2 out of every 10 recommendations are items the customer actually wanted"
- Recall@10 = 0.084 means "on average, the model finds 8.4% of a customer's wanted items in just 10 slots"
- Hit Rate = 60% means "60% of customers got at least one relevant recommendation"
---
#### Why the Numbers Look Small β€” And Why That's Normal
| Metric | Typical Value | Seems low? | Why it's actually good |
|---|---|---|---|
| Precision@10 | 0.01 – 0.10 | Yes | With 3,665 products, finding *any* relevant item in 10 slots is hard. Random guessing gives ~0.003. |
| Recall@10 | 0.005 – 0.05 | Yes | A customer may have 20+ hidden items. Finding even 1 in 10 slots means the model understood their taste. |
| Hit Rate | 3% – 15% | Maybe | This means 3–15% of customers got at least one useful recommendation out of 10. At scale, that drives revenue. |
---
#### RMSE and RΒ² β€” Prediction Accuracy Metrics
While Precision/Recall/Hit Rate measure **ranking quality** ("Did we recommend the right items?"),
RMSE and RΒ² measure **prediction accuracy** ("How well did we predict purchase quantities?").
**RMSE (Root Mean Squared Error):**
"""
)
st.latex(r"\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\text{actual}_i - \text{predicted}_i)^2}")
st.markdown(
"""
**What it measures:** Average error in predicting how many units a customer will buy.
**Example:**
| Customer | Item | Actual Quantity | Predicted Quantity | Error | Squared Error |
|---|---|---|---|---|---|
| A | Teacup | 5 | 4.2 | -0.8 | 0.64 |
| A | Candles | 3 | 5.1 | +2.1 | 4.41 |
| B | Cake Stand | 2 | 1.8 | -0.2 | 0.04 |
| B | Bunting | 8 | 6.5 | -1.5 | 2.25 |
| C | Baking Set | 1 | 1.3 | +0.3 | 0.09 |
RMSE = √[(0.64 + 4.41 + 0.04 + 2.25 + 0.09) / 5] = √(7.43 / 5) = √1.486 = **1.22 units**
**Interpretation:** On average, our predictions are off by about 1.22 units.
- **Lower is better** (0 = perfect predictions)
- RMSE = 1.5 means predictions are typically within Β±1.5 units of actual
- Useful when quantity matters (inventory planning, revenue forecasting)
**When RMSE is important:**
- Predicting how much inventory to stock
- Forecasting revenue from recommendations
- Understanding if the model captures purchase patterns accurately
---
**RΒ² (R-squared / Coefficient of Determination):**
"""
)
st.latex(r"R^2 = 1 - \frac{\sum (\text{actual} - \text{predicted})^2}{\sum (\text{actual} - \text{mean})^2}")
st.markdown(
"""
**What it measures:** How much variance in purchase quantities the model explains.
**Think of it as:** "How much better is the model than just guessing the average?"
**Example:**
Suppose the average purchase quantity across all items is 3.5 units.
| Customer | Item | Actual | Mean Baseline | Model Prediction | Baseline ErrorΒ² | Model ErrorΒ² |
|---|---|---|---|---|---|---|
| A | Teacup | 5 | 3.5 | 4.2 | (5-3.5)Β² = 2.25 | (5-4.2)Β² = 0.64 |
| A | Candles | 3 | 3.5 | 5.1 | (3-3.5)Β² = 0.25 | (3-5.1)Β² = 4.41 |
| B | Cake Stand | 2 | 3.5 | 1.8 | (2-3.5)Β² = 2.25 | (2-1.8)Β² = 0.04 |
| B | Bunting | 8 | 3.5 | 6.5 | (8-3.5)Β² = 20.25 | (8-6.5)Β² = 2.25 |
| C | Baking Set | 1 | 3.5 | 1.3 | (1-3.5)Β² = 6.25 | (1-1.3)Β² = 0.09 |
Sum of baseline errorsΒ² = 31.25
Sum of model errorsΒ² = 7.43
RΒ² = 1 - (7.43 / 31.25) = 1 - 0.238 = **0.762**
**Interpretation:**
| RΒ² Value | Meaning |
|---|---|
| **1.0** | Perfect predictions (model explains 100% of variance) |
| **0.76** | Model explains 76% of variance (good!) |
| **0.5** | Model explains 50% of variance (moderate) |
| **0.0** | Model is no better than just predicting the mean |
| **Negative** | Model is worse than predicting the mean (bad fit!) |
**Why RΒ² can be negative in recommendations:**
Unlike regression on continuous data, recommendation systems face challenges:
- Sparse data (most user-item pairs are empty)
- Implicit feedback (purchases don't mean "liked X amount")
- Cold start (new users/items have no history)
A negative RΒ² means the model's predictions are worse than just guessing the average
purchase quantity. This often happens with:
- SVD (treats missing as 0, distorts predictions)
- NMF (non-negativity constraint can be too restrictive)
- Models that overfit to training data
**When RΒ² is important:**
- Comparing models on the same dataset
- Understanding if latent factors capture meaningful patterns
- Deciding if a complex model justifies its complexity
---
#### Ranking Metrics vs Prediction Metrics β€” Which to Use?
| Scenario | Use These Metrics | Why |
|---|---|---|
| **Homepage "Recommended for You"** | Precision@K, Recall@K, Hit Rate | You care about showing relevant items, not predicting exact quantities |
| **"You might also like" section** | Precision@K, Hit Rate | Ranking quality matters most |
| **Inventory planning** | RMSE, RΒ² | You need accurate quantity predictions |
| **Revenue forecasting** | RMSE, RΒ² | Dollar amounts depend on predicted quantities |
| **A/B testing recommendations** | All metrics | Compare both ranking quality and prediction accuracy |
| **Model selection** | Precision@K + RMSE | Balance relevance with prediction accuracy |
**In practice:**
- **Precision@K** and **Hit Rate** are most important for user-facing recommendations
- **RMSE** and **RΒ²** are useful for business planning and understanding model quality
- A model can have good ranking (high Precision@K) but poor quantity prediction (low RΒ²)
- For e-commerce, ranking usually matters more than exact quantities
---
#### Simple Explanation for Non-Technical Stakeholders
**Precision@K:** "Out of 10 recommendations, how many did the customer actually want?"
β†’ Measures recommendation relevance
**Hit Rate:** "What % of customers got at least one good recommendation?"
β†’ Measures overall success rate
**RMSE:** "When we predict a customer will buy 5 units, how far off are we on average?"
β†’ Measures prediction error in units
**RΒ²:** "How much better are our predictions than just guessing the average?"
β†’ Measures model quality (1.0 = perfect, 0 = no better than average, negative = worse than average)
**The business perspective:** If you have 100,000 customers and Hit Rate is 8%, that's
8,000 customers who received a relevant product suggestion they wouldn't have found otherwise.
Even a 2% conversion rate on those turns into 160 additional sales β€” from a fully automated system.
**Comparing models:** The absolute numbers matter less than the *relative ranking*.
If ALS gets 8% Hit Rate and SVD gets 5%, ALS is clearly better at learning from
this data β€” even though both numbers seem small in isolation.
---
#### Why Random Baselines Help Contextualize
For reference, a random recommender on our dataset would achieve approximately:
- **Precision@10 β‰ˆ 0.003** (3,665 items, ~20 relevant β†’ 20/3,665 chance per slot)
- **Hit Rate β‰ˆ 5%** (10 random tries at a 0.5% per-item hit rate)
Any model scoring above these baselines has learned real patterns from the data.
"""
)
# ═══════════════════════════════════════════════════════════════════════════════
# Tab 3 β€” Interactive Recommendations
# ═══════════════════════════════════════════════════════════════════════════════
with tab_demo:
st.subheader("Interactive Recommendation Demo")
st.markdown(
"Select a customer and see what each model recommends. "
"You can also see what the customer has actually purchased to judge quality."
)
if not st.session_state.get("rec_models_trained"):
st.warning("Please train the models first in the **Model Comparison** tab.")
else:
train_df = st.session_state["train_df"]
item_desc = st.session_state["item_desc"]
item_map = st.session_state["item_map"]
customer_list = list(train_df.index)
active_customers = [c for c in customer_list if (train_df.loc[c] > 0).sum() >= 10]
selected_customer = st.selectbox(
"Select a Customer",
active_customers[:100],
format_func=lambda x: f"Customer {int(x)} ({int((train_df.loc[x] > 0).sum())} products purchased)",
)
user_idx = customer_list.index(selected_customer)
st.markdown("#### What this customer has purchased")
purchased = train_df.loc[selected_customer]
purchased_items = purchased[purchased > 0].sort_values(ascending=False).head(15)
purch_display = pd.DataFrame({
"StockCode": purchased_items.index,
"Quantity": purchased_items.values.astype(int),
"Description": [item_desc.get(sc, "N/A") for sc in purchased_items.index],
})
st.dataframe(purch_display, use_container_width=True, hide_index=True)
st.markdown("---")
st.markdown("#### Recommendations from each model")
model_preds = {
"SVD": st.session_state["svd_pred"],
"ALS": st.session_state["als_pred"],
"SGD (Funk SVD)": st.session_state["sgd_pred"],
"NMF": st.session_state["nmf_pred"],
"Item-CF": st.session_state["cf_pred"],
}
n_recs = st.slider("Number of recommendations", 5, 20, 10)
cols = st.columns(2)
for i, (model_name, pred_matrix) in enumerate(model_preds.items()):
with cols[i % 2]:
st.markdown(f"**{model_name}**")
top_items, top_scores = get_top_n_recommendations(
pred_matrix, train_df.values, user_idx, n=n_recs,
)
rec_display = pd.DataFrame({
"Rank": range(1, len(top_items) + 1),
"StockCode": [train_df.columns[j] for j in top_items],
"Score": [f"{s:.3f}" for s in top_scores],
"Description": [item_desc.get(train_df.columns[j], "N/A") for j in top_items],
})
st.dataframe(rec_display, use_container_width=True, hide_index=True)
st.markdown("---")
st.markdown(
"**How to evaluate these recommendations:** Look at the customer's purchase "
"history above and check if the recommended items make sense. For example, if "
"the customer bought a lot of kitchen items, good recommendations would include "
"other kitchen products they haven't purchased yet."
)