Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| import pandas as pd | |
| import numpy as np | |
| import plotly.graph_objects as go | |
| from utils.rec_data_loader import build_interaction_matrix, get_rec_train_test | |
| from utils.rec_models import ( | |
| train_svd, train_als, train_sgd, train_nmf, train_item_cf, | |
| get_top_n_recommendations, evaluate_recommendations, | |
| ) | |
| st.set_page_config(page_title="Rec Models", page_icon="π―", layout="wide") | |
| st.title("Product Recommendation β Models") | |
| st.markdown("---") | |
| tab_how, tab_modern, tab_compare, tab_demo = st.tabs([ | |
| "How The Algorithms Work", "Modern Architecture & Industry", "Model Comparison", "Interactive Recommendations" | |
| ]) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 1 β How The Algorithms Work | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_how: | |
| st.subheader("Recommendation Algorithms β A Visual Guide") | |
| st.markdown( | |
| "All these algorithms solve the same problem: **fill in the empty cells** " | |
| "of the user-item matrix. They just approach it differently." | |
| ) | |
| # ββ Matrix Factorization overview ββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### The Core Idea: Matrix Factorization") | |
| st.markdown( | |
| "Matrix factorization is like discovering the **hidden reasons** behind purchases." | |
| ) | |
| st.graphviz_chart(""" | |
| digraph mf { | |
| rankdir=LR | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| R [label="User-Item Matrix\\nR (4338 Γ 3665)\\nMostly empty", fillcolor="#dbeafe", color="#3b82f6"] | |
| approx [label="β", shape=plaintext, fontsize=24] | |
| U [label="User Matrix\\nU (4338 Γ k)\\nEach user as k\\nlatent preferences", fillcolor="#d1fae5", color="#10b981"] | |
| times [label="Γ", shape=plaintext, fontsize=24] | |
| V [label="Item Matrix\\nV (k Γ 3665)\\nEach item as k\\nlatent attributes", fillcolor="#fce7f3", color="#ec4899"] | |
| R -> approx [arrowhead=none] | |
| approx -> U [arrowhead=none] | |
| U -> times [arrowhead=none] | |
| times -> V [arrowhead=none] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **Example:** | |
| Suppose there are 3 hidden factors: "kitchen items", "vintage style", "party supplies". | |
| - **User profile** (U): Customer 12345 = [0.9 kitchen, 0.1 vintage, 0.7 party] | |
| - **Item profile** (V): "Cake Stand" = [0.8 kitchen, 0.3 vintage, 0.5 party] | |
| - **Predicted score** = 0.9Γ0.8 + 0.1Γ0.3 + 0.7Γ0.5 = **1.10** β strong recommendation! | |
| The model **discovers** these latent factors automatically from the purchase patterns. | |
| We set the number of factors (k), and the algorithm figures out what they represent. | |
| """ | |
| ) | |
| with st.expander("Where do the vectors come from? (Step-by-step from scratch)", expanded=False): | |
| st.markdown( | |
| """ | |
| At the start, **no vectors exist**. We only have raw purchase data: | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Customer A** | 5 | 3 | ? | 1 | | |
| | **Customer B** | 4 | ? | 2 | ? | | |
| | **Customer C** | ? | 4 | 5 | 3 | | |
| The `?` entries are what we want to predict. | |
| --- | |
| **Step 1 β Create random vectors** | |
| We just assign random numbers. With k=2 factors: | |
| | User | f1 | f2 | | |
| |---|---|---| | |
| | A | 0.3 | 0.7 | | |
| | B | 0.5 | 0.2 | | |
| | C | 0.1 | 0.9 | | |
| | Item | f1 | f2 | | |
| |---|---|---| | |
| | Cake Stand | 0.4 | 0.6 | | |
| | Candles | 0.8 | 0.3 | | |
| | Baking Set | 0.2 | 0.7 | | |
| | Bunting | 0.6 | 0.1 | | |
| These are random guesses. They are wrong. That's the starting point for training. | |
| --- | |
| **Step 2 β Predict using the full formula** | |
| The raw dot product gives small numbers (e.g. 0.54), but purchase counts are on | |
| a different scale. So the full prediction formula includes **bias terms**: | |
| """ | |
| ) | |
| st.latex(r"\text{predicted} = \underbrace{\mu}_{\text{global mean}} + \underbrace{b_u}_{\text{user bias}} + \underbrace{b_i}_{\text{item bias}} + \underbrace{\vec{u} \cdot \vec{v}}_{\text{dot product}}") | |
| st.markdown( | |
| """ | |
| | Term | Value | What it captures | | |
| |---|---|---| | |
| | **ΞΌ** (global mean) | 3.4 | Average of all known values in the matrix | | |
| | **bα΅€** (user bias) | +0.5 | Customer A buys more than the average customer | | |
| | **bα΅’** (item bias) | +0.4 | Cake Stand is purchased more than the average item | | |
| | **dot product** | 0.54 | Specific affinity between this user and this item | | |
| `predicted = 3.4 + 0.5 + 0.4 + 0.54 = 4.84 β 5` β | |
| Without biases, the model would see enormous errors during early training | |
| and the vectors would take much longer to converge β or might not converge at all. | |
| --- | |
| **Step 3 β Calculate error, adjust, repeat** | |
| `error = actual β predicted = 5 β 4.84 = 0.16` | |
| The algorithm nudges the vectors and biases slightly to reduce this error, | |
| then moves to the next rating and does the same. After thousands of passes | |
| (epochs), the vectors converge to values that accurately reconstruct all | |
| the known ratings β and can predict the `?` cells. | |
| The individual algorithm walkthroughs below show exactly how each method | |
| does this differently. | |
| """ | |
| ) | |
| # ββ SVD ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### 1. SVD (Singular Value Decomposition)") | |
| st.markdown("*The classic matrix factorization technique from linear algebra.*") | |
| st.graphviz_chart(""" | |
| digraph svd { | |
| rankdir=LR | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| R [label="R\\nUser-Item\\nMatrix", fillcolor="#dbeafe", color="#3b82f6"] | |
| eq [label="=", shape=plaintext, fontsize=24] | |
| U [label="U\\nUser factors\\n(users Γ k)", fillcolor="#d1fae5", color="#10b981"] | |
| S [label="Ξ£\\nSingular values\\n(k Γ k diagonal)\\nimportance weights", fillcolor="#fef3c7", color="#f59e0b"] | |
| Vt [label="Vα΅\\nItem factors\\n(k Γ items)", fillcolor="#fce7f3", color="#ec4899"] | |
| R -> eq [arrowhead=none] | |
| eq -> U [arrowhead=none] | |
| U -> S [label=" Γ "] | |
| S -> Vt [label=" Γ "] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **How it works:** | |
| 1. SVD decomposes the user-item matrix R into three matrices: **U Γ Ξ£ Γ Vα΅** | |
| 2. **U** captures user preferences in k-dimensional "latent space" | |
| 3. **Ξ£** (Sigma) is a diagonal matrix of importance weights β the first few factors | |
| capture the most important patterns, the rest are noise | |
| 4. **Vα΅** captures item characteristics in the same latent space | |
| 5. We keep only the top **k** factors (truncated SVD) β this filters noise and | |
| fills in the missing values with predictions | |
| **Strengths:** Mathematically elegant, well-understood, fast with sparse matrices. | |
| **Weaknesses:** Treats missing entries as zero (problematic for implicit feedback β | |
| "didn't buy" β "doesn't like"). | |
| """ | |
| ) | |
| with st.expander("Step-by-step numerical walkthrough β SVD", expanded=False): | |
| st.markdown( | |
| """ | |
| **Starting point β the user-item matrix (purchase counts):** | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Customer A** | 5 | 3 | ? | 1 | | |
| | **Customer B** | 4 | ? | 2 | ? | | |
| | **Customer C** | ? | 4 | 5 | 3 | | |
| `?` = never purchased (stored as 0 in the matrix). | |
| **Step 1 β Decompose the matrix** | |
| SVD splits this matrix into three pieces: R = U Γ Ξ£ Γ Vα΅ | |
| With k=2 factors, we get: | |
| - **U** (3 users Γ 2 factors) β each user as 2 numbers | |
| - **Ξ£** (2 Γ 2 diagonal) β importance weights | |
| - **Vα΅** (2 factors Γ 4 items) β each item as 2 numbers | |
| SVD computes this directly using linear algebra (no iterative training). | |
| **Step 2 β The decomposition gives us vectors** | |
| After SVD: | |
| | User | Factor 1 | Factor 2 | | |
| |---|---|---| | |
| | A | 0.72 | -0.31 | | |
| | B | 0.58 | -0.44 | | |
| | C | 0.85 | 0.52 | | |
| | Item | Factor 1 | Factor 2 | | |
| |---|---|---| | |
| | Cake Stand | 0.68 | -0.25 | | |
| | Candles | 0.55 | 0.41 | | |
| | Baking Set | 0.71 | 0.50 | | |
| | Bunting | 0.43 | 0.38 | | |
| **Step 3 β Predict missing values** | |
| Customer B, Baking Set (was `?`): | |
| `predicted = [0.58, -0.44] Β· [0.71, 0.50] = (0.58Γ0.71) + (-0.44Γ0.50) = 0.41 - 0.22 = 0.19` | |
| Low score β weak recommendation. | |
| Customer A, Baking Set (was `?`): | |
| `predicted = [0.72, -0.31] Β· [0.71, 0.50] = (0.72Γ0.71) + (-0.31Γ0.50) = 0.51 - 0.16 = 0.36` | |
| Higher score β stronger recommendation than for B. | |
| **Key point:** SVD fills the `?` cells using the learned vectors. The math | |
| ensures that users with similar purchase patterns get similar vectors, and | |
| therefore similar predictions for unseen items. | |
| **Limitation:** SVD treats every `?` as 0 during decomposition. So "never bought" | |
| is treated the same as "bought 0 times" β which loses the distinction between | |
| "not interested" and "never seen this product." | |
| """ | |
| ) | |
| # ββ ALS ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### 2. ALS (Alternating Least Squares)") | |
| st.markdown("*The industry standard for implicit feedback β used at Spotify and Netflix.*") | |
| st.graphviz_chart(""" | |
| digraph als { | |
| rankdir=TB | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| init [label="Initialize\\nRandom user & item\\nfactor matrices", fillcolor="#dbeafe", color="#3b82f6"] | |
| fix_u [label="Fix User factors\\nOptimize Item factors\\n(least squares)", fillcolor="#d1fae5", color="#10b981"] | |
| fix_i [label="Fix Item factors\\nOptimize User factors\\n(least squares)", fillcolor="#fce7f3", color="#ec4899"] | |
| check [label="Check convergence\\n(error decreasing?)", fillcolor="#fef3c7", color="#f59e0b"] | |
| done [label="Final U and V\\nPredict: U Γ Vα΅", fillcolor="#e0e7ff", color="#6366f1"] | |
| init -> fix_u -> fix_i -> check | |
| check -> fix_u [label="not converged", style=dashed] | |
| check -> done [label="converged"] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **How it works:** | |
| 1. Start with random user factors (U) and item factors (V) | |
| 2. **Fix U**, solve for the best V (this is a standard least squares problem) | |
| 3. **Fix V**, solve for the best U (again, least squares) | |
| 4. **Alternate** between steps 2 and 3 until the error stops decreasing | |
| 5. Multiply U Γ Vα΅ to get predicted scores for every user-item pair | |
| **Why "Alternating"?** Optimizing U and V simultaneously is hard (non-convex). | |
| But if you fix one, optimizing the other becomes a simple, solvable least-squares | |
| problem. By alternating, we make steady progress toward a good solution. | |
| **Key advantage for implicit feedback:** ALS treats observed interactions (purchases) | |
| and unobserved ones (empty cells) differently. It assigns **confidence weights** β | |
| a purchase of 10 units gets higher confidence than 1 unit, while empty cells get | |
| low (but non-zero) confidence. This is exactly right for our e-commerce data. | |
| **Strengths:** Designed for implicit data, parallelizable, scales to millions of users. | |
| **Weaknesses:** More complex to implement, requires tuning regularization. | |
| """ | |
| ) | |
| with st.expander("Step-by-step numerical walkthrough β How ALS Actually Works", expanded=False): | |
| st.markdown( | |
| """ | |
| ALS (Alternating Least Squares) is different from gradient descent methods. | |
| Let me show you the real ALS algorithm with numbers. | |
| **Starting point β same matrix:** | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Customer A** | 5 | 3 | ? | 1 | | |
| | **Customer B** | 4 | ? | 2 | ? | | |
| | **Customer C** | ? | 4 | 5 | 3 | | |
| --- | |
| **Step 1 β Initialize random vectors (k=2 factors)** | |
| | User | f1 | f2 | | |
| |---|---|---| | |
| | A | 0.3 | 0.7 | | |
| | B | 0.5 | 0.2 | | |
| | C | 0.1 | 0.9 | | |
| | Item | f1 | f2 | | |
| |---|---|---| | |
| | Cake Stand | 0.4 | 0.6 | | |
| | Candles | 0.8 | 0.3 | | |
| | Baking Set | 0.2 | 0.7 | | |
| | Bunting | 0.6 | 0.1 | | |
| --- | |
| **Step 2 β Fix User vectors, Solve for ALL Item vectors at once** | |
| This is where ALS is different! Instead of adjusting one vector at a time, | |
| we solve for ALL item vectors simultaneously using a mathematical formula. | |
| For **Cake Stand** (bought by A and B): | |
| Users who bought it: | |
| - Customer A: rating=5, vector=[0.3, 0.7] | |
| - Customer B: rating=4, vector=[0.5, 0.2] | |
| Build matrices: | |
| ``` | |
| U = [[0.3, 0.7], (users who bought Cake Stand) | |
| [0.5, 0.2]] | |
| r = [5, 4] (their ratings) | |
| ``` | |
| The closed-form solution: | |
| """ | |
| ) | |
| st.latex(r"v_{\text{CakeStand}} = (U^T U + \lambda I)^{-1} U^T r") | |
| st.markdown( | |
| """ | |
| With regularization Ξ»=0.1: | |
| ``` | |
| U^T U = [[0.34, 0.31], | |
| [0.31, 0.53]] | |
| U^T U + Ξ»I = [[0.44, 0.31], | |
| [0.31, 0.63]] | |
| U^T r = [3.5, 4.3] | |
| Invert and multiply: | |
| v_CakeStand = [2.8, 5.1] β New vector! | |
| ``` | |
| **Key difference:** We didn't use a learning rate or take small steps. | |
| We solved the equation exactly using matrix math! | |
| Do this for ALL items: | |
| - Cake Stand β [2.8, 5.1] | |
| - Candles β [1.9, 3.2] | |
| - Baking Set β [2.1, 4.8] | |
| - Bunting β [1.5, 2.9] | |
| --- | |
| **Step 3 β Fix Item vectors, Solve for ALL User vectors at once** | |
| Now flip it! Fix the new item vectors and solve for user vectors. | |
| For **Customer A** (bought Cake Stand, Candles, Bunting): | |
| Items they bought: | |
| - Cake Stand: rating=5, vector=[2.8, 5.1] | |
| - Candles: rating=3, vector=[1.9, 3.2] | |
| - Bunting: rating=1, vector=[1.5, 2.9] | |
| Build matrices: | |
| ``` | |
| V = [[2.8, 5.1], (items Customer A bought) | |
| [1.9, 3.2], | |
| [1.5, 2.9]] | |
| r = [5, 3, 1] (Customer A's ratings) | |
| ``` | |
| Same formula: | |
| """ | |
| ) | |
| st.latex(r"u_A = (V^T V + \lambda I)^{-1} V^T r") | |
| st.markdown( | |
| """ | |
| Result: `u_A = [0.82, 0.91]` β New user vector! | |
| Do this for ALL users: | |
| - Customer A β [0.82, 0.91] | |
| - Customer B β [0.73, 0.45] | |
| - Customer C β [0.31, 0.88] | |
| --- | |
| **Step 4 β Alternate until convergence** | |
| | Iteration | What happens | Error | | |
| |---|---|---| | |
| | 1 | Fix U, solve all V | 3.2 | | |
| | 2 | Fix V, solve all U | 1.8 | | |
| | 3 | Fix U, solve all V | 0.9 | | |
| | 4 | Fix V, solve all U | 0.4 | | |
| | 5 | Fix U, solve all V | 0.2 | | |
| | 6 | Fix V, solve all U | 0.1 | | |
| After ~6 iterations, vectors converge! | |
| (Compare this to SGD which needs 100+ iterations) | |
| --- | |
| **Step 5 β Make predictions** | |
| Customer A β Baking Set (`?`): | |
| `predicted = [0.82, 0.91] Β· [2.1, 4.8] = 0.82Γ2.1 + 0.91Γ4.8 = 1.72 + 4.37 = 6.09` | |
| Recommend Baking Set to Customer A! | |
| --- | |
| **Why ALS is different:** | |
| | Aspect | ALS | SGD (shown in next section) | | |
| |---|---|---| | |
| | **Updates** | All users at once (or all items at once) | One rating at a time | | |
| | **Math** | Matrix inversion: `(A^T A)^(-1) A^T b` | Gradient descent: `v += Ξ± Γ error Γ u` | | |
| | **Learning rate** | Not needed (exact solution) | Required (Ξ± = 0.01) | | |
| | **Iterations** | Few (~10-20) | Many (100-200) | | |
| | **Per iteration** | Slow (matrix operations) | Fast (simple arithmetic) | | |
| | **Parallelization** | Easy (each user/item independent) | Hard (sequential updates) | | |
| | **Best for** | Large-scale systems (Spotify, Netflix) | Small datasets, research | | |
| --- | |
| **Simple analogy:** | |
| **ALS:** You're a teacher grading 100 students. You collect ALL math tests, | |
| grade them all at once using an answer key, then collect ALL English tests | |
| and grade those. You alternate subjects, but grade all students per subject. | |
| **SGD:** You grade one student's math test, then their English test, then | |
| the next student's math test, then their English test. One at a time. | |
| ALS is much faster when you have many students (users/items) because you | |
| can split the grading across multiple teachers (parallel processing). | |
| """ | |
| ) | |
| # ββ SGD ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### 3. SGD Matrix Factorization (Funk SVD)") | |
| st.markdown("*The algorithm that powered the Netflix Prize β learns vectors one rating at a time.*") | |
| st.graphviz_chart(""" | |
| digraph sgd { | |
| rankdir=TB | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| init [label="Initialize random\\nuser vectors, item vectors,\\nuser biases, item biases", fillcolor="#dbeafe", color="#3b82f6"] | |
| pick [label="Pick one observed\\nrating (u, i, r)", fillcolor="#e0e7ff", color="#6366f1"] | |
| pred [label="Predict\\nΞΌ + bα΅€ + bα΅’ + uβΒ·vβ", fillcolor="#d1fae5", color="#10b981"] | |
| err [label="Error = actual β predicted", fillcolor="#fef3c7", color="#f59e0b"] | |
| upd [label="Nudge vectors & biases\\nin direction that reduces error", fillcolor="#fce7f3", color="#ec4899"] | |
| check [label="More ratings\\nin this epoch?", fillcolor="#e0e7ff", color="#6366f1"] | |
| epoch [label="More epochs?", fillcolor="#dbeafe", color="#3b82f6"] | |
| done [label="Final prediction matrix\\nΞΌ + bα΅€ + bα΅’ + UΓVα΅", fillcolor="#d1fae5", color="#10b981"] | |
| init -> pick -> pred -> err -> upd -> check | |
| check -> pick [label="yes", style=dashed] | |
| check -> epoch [label="no (epoch done)"] | |
| epoch -> pick [label="yes", style=dashed] | |
| epoch -> done [label="no (converged)"] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **How it works:** | |
| 1. Start with random user vectors, item vectors, and bias terms | |
| 2. For each observed interaction, predict the score using the full formula: | |
| **ΞΌ + bα΅€ + bα΅’ + user_vec Β· item_vec** | |
| 3. Compute the error (actual β predicted) | |
| 4. Nudge the vectors and biases slightly to reduce that error (gradient descent) | |
| 5. Repeat for every observed interaction (one epoch), then repeat for many epochs | |
| **Why this is different from pure SVD:** | |
| - Pure SVD decomposes the *entire* matrix including zeros (missing = dislike) | |
| - SGD only trains on *observed interactions* β it skips empty cells entirely | |
| - SGD includes bias terms that anchor predictions to the right scale | |
| **Why "Funk SVD"?** Simon Funk published this approach during the Netflix Prize | |
| (2006). It's not technically SVD β it's gradient descent β but the name stuck. | |
| This was the breakthrough that made matrix factorization practical for recommendation. | |
| **Strengths:** Handles missing data correctly, includes bias terms, flexible, well-understood. | |
| **Weaknesses:** Slow (one rating at a time), sensitive to learning rate, can overfit without regularization. | |
| """ | |
| ) | |
| with st.expander("Step-by-step numerical walkthrough β SGD (Funk SVD)", expanded=False): | |
| st.markdown( | |
| """ | |
| **Simple explanation first:** | |
| SGD (Stochastic Gradient Descent) learns by looking at one purchase at a time: | |
| 1. "Customer A bought 5 Cake Stands" | |
| 2. Predict how many they'd buy using current vectors: "I predict 3.9" | |
| 3. Calculate error: "I was off by 1.1 (too low)" | |
| 4. Adjust vectors slightly: "Make Customer A's vector a bit more like Cake Stand's vector" | |
| 5. Move to next purchase and repeat | |
| After seeing all purchases many times (100+ epochs), the vectors learn the patterns. | |
| **Key insight:** It's like learning to cook by tasting after each ingredient. | |
| You add salt, taste, adjust. Add pepper, taste, adjust. Small corrections each time. | |
| --- | |
| **Now the detailed math:** | |
| **Starting point β same matrix, but we only use observed entries:** | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Customer A** | 5 | 3 | *skip* | 1 | | |
| | **Customer B** | 4 | *skip* | 2 | *skip* | | |
| | **Customer C** | *skip* | 4 | 5 | 3 | | |
| Unlike SVD, the `?` cells are ignored during training β not treated as zero. | |
| --- | |
| **Step 1 β Initialize everything** | |
| Global mean: ΞΌ = (5+3+1+4+2+4+5+3) / 8 = **3.375** | |
| Random biases (start near zero): | |
| | | Bias | | |
| |---|---| | |
| | b_A | +0.01 | | |
| | b_B | -0.02 | | |
| | b_C | +0.01 | | |
| | b_CakeStand | +0.03 | | |
| | b_Candles | -0.01 | | |
| | b_BakingSet | +0.02 | | |
| | b_Bunting | -0.01 | | |
| Random vectors (k=2): | |
| | User | f1 | f2 | | |
| |---|---|---| | |
| | A | 0.30 | 0.70 | | |
| | B | 0.50 | 0.20 | | |
| | C | 0.10 | 0.90 | | |
| | Item | f1 | f2 | | |
| |---|---|---| | |
| | Cake Stand | 0.40 | 0.60 | | |
| | Candles | 0.80 | 0.30 | | |
| | Baking Set | 0.20 | 0.70 | | |
| | Bunting | 0.60 | 0.10 | | |
| --- | |
| **Step 2 β Process first rating: Customer A β Cake Stand (actual = 5)** | |
| """ | |
| ) | |
| st.latex(r"\text{pred} = \mu + b_A + b_{\text{CakeStand}} + \vec{u}_A \cdot \vec{v}_{\text{CakeStand}}") | |
| st.markdown( | |
| """ | |
| `pred = 3.375 + 0.01 + 0.03 + (0.30Γ0.40 + 0.70Γ0.60)` | |
| `= 3.375 + 0.01 + 0.03 + (0.12 + 0.42)` | |
| `= 3.375 + 0.04 + 0.54 = 3.955` | |
| `error = 5 - 3.955 = 1.045` | |
| --- | |
| **Step 3 β Update using learning rate Ξ±=0.01 and regularization Ξ»=0.02** | |
| """ | |
| ) | |
| st.latex(r"b_u \leftarrow b_u + \alpha \cdot (\text{error} - \lambda \cdot b_u)") | |
| st.latex(r"\vec{u} \leftarrow \vec{u} + \alpha \cdot (\text{error} \cdot \vec{v} - \lambda \cdot \vec{u})") | |
| st.markdown( | |
| """ | |
| **Bias updates:** | |
| `b_A = 0.01 + 0.01 Γ (1.045 - 0.02 Γ 0.01) = 0.01 + 0.01044 = 0.0204` | |
| `b_CakeStand = 0.03 + 0.01 Γ (1.045 - 0.02 Γ 0.03) = 0.03 + 0.01044 = 0.0404` | |
| **Vector updates:** | |
| `A_f1 = 0.30 + 0.01 Γ (1.045 Γ 0.40 - 0.02 Γ 0.30) = 0.30 + 0.00412 = 0.3041` | |
| `A_f2 = 0.70 + 0.01 Γ (1.045 Γ 0.60 - 0.02 Γ 0.70) = 0.70 + 0.00613 = 0.7061` | |
| `CakeStand_f1 = 0.40 + 0.01 Γ (1.045 Γ 0.30 - 0.02 Γ 0.40) = 0.40 + 0.00306 = 0.4031` | |
| `CakeStand_f2 = 0.60 + 0.01 Γ (1.045 Γ 0.70 - 0.02 Γ 0.60) = 0.60 + 0.00720 = 0.6072` | |
| Small adjustments, but in the right direction. The regularization term | |
| (Ξ» Γ current_value) prevents weights from growing too large. | |
| --- | |
| **Step 4 β Process next rating: Customer A β Candles (actual = 3)** | |
| Using updated vectors for A (but original vectors for Candles): | |
| `pred = 3.375 + 0.0204 + (-0.01) + (0.3041Γ0.80 + 0.7061Γ0.30)` | |
| `= 3.375 + 0.0104 + (0.2433 + 0.2118) = 3.375 + 0.0104 + 0.4551 = 3.840` | |
| `error = 3 - 3.840 = -0.840` (prediction too high β nudge down) | |
| Updates push vectors in the opposite direction this time. | |
| --- | |
| **Step 5 β After one epoch (all 8 ratings processed)** | |
| | Rating processed | Error | | |
| |---|---| | |
| | A β Cake Stand | +1.045 | | |
| | A β Candles | -0.840 | | |
| | A β Bunting | -2.125 | | |
| | B β Cake Stand | +0.612 | | |
| | B β Baking Set | -1.389 | | |
| | C β Candles | +0.738 | | |
| | C β Baking Set | +1.455 | | |
| | C β Bunting | -0.204 | | |
| Average absolute error β 1.05 | |
| --- | |
| **Step 6 β After many epochs** | |
| | Epoch | Avg Absolute Error | | |
| |---|---| | |
| | 1 | 1.05 | | |
| | 10 | 0.52 | | |
| | 50 | 0.12 | | |
| | 100 | 0.04 | | |
| --- | |
| **Step 7 β Predict missing values** | |
| Customer A β Baking Set (`?`): | |
| `pred = ΞΌ + b_A + b_BakingSet + u_A Β· v_BakingSet` | |
| With trained values: `= 3.375 + 0.45 + 0.38 + 0.62 = 4.83` β recommend! | |
| Customer B β Bunting (`?`): | |
| `= 3.375 + (-0.15) + (-0.28) + 0.31 = 3.26` β moderate recommendation | |
| The bias terms are critical β they handle the difference between | |
| heavy buyers (high b_u) and light buyers (low b_u), and popular | |
| items (high b_i) versus niche items (low b_i). | |
| """ | |
| ) | |
| # ββ Simple Comparison: ALS vs SGD ββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| with st.expander("π― ALS vs SGD β Simple Comparison for Non-Technical Audience", expanded=False): | |
| st.markdown( | |
| """ | |
| Both ALS and SGD learn user and item "profiles" (vectors) to make recommendations. | |
| They both end up with similar results, but they learn differently. | |
| --- | |
| ### The Restaurant Analogy | |
| **Scenario:** You're training staff to predict how much customers will tip. | |
| You have 100 customers and 50 waiters. Each customer-waiter pair has a tip amount. | |
| **SGD approach (One at a time):** | |
| - Look at Customer 1 with Waiter A: tip was $5, you predicted $3 | |
| - "Oops, too low! Adjust Customer 1's 'generosity score' up a bit" | |
| - "Also adjust Waiter A's 'service score' up a bit" | |
| - Now look at Customer 1 with Waiter B: tip was $2, you predicted $4 | |
| - "Oops, too high! Adjust both scores down a bit" | |
| - Continue one interaction at a time through all 5,000 customer-waiter pairs | |
| - Repeat this process 100 times until predictions are accurate | |
| **ALS approach (All at once, alternating):** | |
| - Round 1: Fix all customer scores, solve for ALL waiter scores at once using math | |
| - Round 2: Fix all waiter scores, solve for ALL customer scores at once using math | |
| - Round 3: Fix all customer scores, solve for ALL waiter scores at once | |
| - After ~10 rounds, you're done! | |
| --- | |
| ### Key Differences in Plain English | |
| | Aspect | SGD | ALS | | |
| |--------|-----|-----| | |
| | **Learning style** | One interaction at a time, like studying flashcards | All at once per round, like grading all tests together | | |
| | **Speed per round** | Fast (simple math) | Slow (complex matrix math) | | |
| | **Rounds needed** | Many (100-200) | Few (10-20) | | |
| | **Total time** | Moderate | Fast for large datasets | | |
| | **Parallelization** | Hard (must go in order) | Easy (can split work across computers) | | |
| | **Best for** | Small datasets, research, when you need bias terms | Large-scale production (Spotify, Netflix) | | |
| --- | |
| ### When to Use Which? | |
| **Use SGD when:** | |
| - You have a small dataset (< 100K interactions) | |
| - You're doing research or prototyping | |
| - You need explicit bias terms (ΞΌ + b_u + b_i) | |
| - You want to understand the algorithm step-by-step | |
| **Use ALS when:** | |
| - You have a large dataset (millions of interactions) | |
| - You're building a production system | |
| - You have multiple servers/cores to parallelize | |
| - Your data is implicit feedback (purchases, clicks, not ratings) | |
| --- | |
| ### The Math Difference (For Technical Readers) | |
| **SGD update rule:** | |
| ``` | |
| error = actual - predicted | |
| u β u + Ξ± Γ (error Γ v - Ξ» Γ u) | |
| v β v + Ξ± Γ (error Γ u - Ξ» Γ v) | |
| ``` | |
| Small steps guided by the error gradient. | |
| **ALS update rule:** | |
| ``` | |
| Fix U, solve for all V: | |
| v_i = (U^T U + Ξ»I)^(-1) U^T r_i | |
| Fix V, solve for all U: | |
| u_j = (V^T V + Ξ»I)^(-1) V^T r_j | |
| ``` | |
| Closed-form solution using matrix inversion. | |
| --- | |
| ### Real-World Example | |
| **Spotify's Discover Weekly (uses ALS):** | |
| - 500 million users | |
| - 100 million songs | |
| - Billions of listening events per day | |
| - Needs to update recommendations daily | |
| - Solution: ALS parallelized across thousands of servers | |
| **Netflix Prize (used SGD):** | |
| - Research competition (2006) | |
| - 500K users, 18K movies | |
| - 100 million ratings (explicit 1-5 stars) | |
| - Goal: Best accuracy, not fastest training | |
| - Solution: SGD with carefully tuned learning rate and regularization | |
| --- | |
| ### Bottom Line | |
| **SGD:** Like a student who learns by doing practice problems one at a time. | |
| Slow but steady, easy to understand. | |
| **ALS:** Like a teacher who grades all tests at once using an answer key. | |
| Fewer iterations, but each iteration is more work. Much faster at scale. | |
| Both get you to the same destination β accurate recommendations! | |
| """ | |
| ) | |
| # ββ NMF ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### 4. NMF (Non-negative Matrix Factorization)") | |
| st.markdown("*Matrix factorization with a twist: everything stays positive.*") | |
| st.graphviz_chart(""" | |
| digraph nmf { | |
| rankdir=LR | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| R [label="R\\nUser-Item Matrix\\n(all values β₯ 0)", fillcolor="#dbeafe", color="#3b82f6"] | |
| approx [label="β", shape=plaintext, fontsize=24] | |
| W [label="W\\nUser factors\\n(all values β₯ 0)\\nadditive parts only", fillcolor="#d1fae5", color="#10b981"] | |
| times [label="Γ", shape=plaintext, fontsize=24] | |
| H [label="H\\nItem factors\\n(all values β₯ 0)\\nadditive parts only", fillcolor="#fce7f3", color="#ec4899"] | |
| R -> approx [arrowhead=none] | |
| approx -> W [arrowhead=none] | |
| W -> times [arrowhead=none] | |
| times -> H [arrowhead=none] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **How it's different from SVD:** | |
| - SVD factors can be negative (hard to interpret: what does a "-0.5 kitchen" preference mean?) | |
| - NMF forces **all values to be β₯ 0** β factors are purely additive | |
| **Why non-negative is useful:** | |
| - A user is "0.8 kitchen + 0.3 vintage + 0 party" β easy to interpret | |
| - Each factor represents an additive "part" of the user's taste | |
| - The results are **naturally sparse** β many factors are exactly zero | |
| - This makes NMF factors more interpretable as "topics" or "themes" | |
| For example, if SVD says "you like kitchen items but with a negative party component", | |
| NMF says "you're 80% kitchen enthusiast, 30% vintage lover, and don't care about party | |
| supplies." The NMF interpretation is more natural for recommendation. | |
| **Strengths:** Interpretable factors, naturally sparse, good for parts-based decomposition. | |
| **Weaknesses:** Only works with non-negative data, sensitive to initialization. | |
| """ | |
| ) | |
| with st.expander("Step-by-step numerical walkthrough β NMF", expanded=False): | |
| st.markdown( | |
| """ | |
| NMF follows the same learn-vectors-from-data approach, but with one strict rule: | |
| **every number must be β₯ 0**. No negatives allowed. | |
| **Starting point β same matrix:** | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Customer A** | 5 | 3 | ? | 1 | | |
| | **Customer B** | 4 | ? | 2 | ? | | |
| | **Customer C** | ? | 4 | 5 | 3 | | |
| All values are already β₯ 0 (purchase counts), so NMF can work with this directly. | |
| --- | |
| **Step 1 β Create random non-negative vectors (k=2)** | |
| | User | f1 ("kitchen") | f2 ("decor") | | |
| |---|---|---| | |
| | A | 0.8 | 0.2 | | |
| | B | 0.6 | 0.3 | | |
| | C | 0.1 | 0.9 | | |
| | Item | f1 ("kitchen") | f2 ("decor") | | |
| |---|---|---| | |
| | Cake Stand | 0.9 | 0.1 | | |
| | Candles | 0.3 | 0.7 | | |
| | Baking Set | 0.2 | 0.8 | | |
| | Bunting | 0.1 | 0.6 | | |
| Notice: no negative values anywhere. The factors labeled "kitchen" and "decor" | |
| are what the algorithm might discover β we don't name them, but after training | |
| they often correspond to interpretable themes. | |
| --- | |
| **Step 2 β Predict** | |
| Customer A β Cake Stand: | |
| `predicted = [0.8, 0.2] Β· [0.9, 0.1] = 0.72 + 0.02 = 0.74` | |
| Actual = 5. Error = 5 - 0.74 = 4.26 (large). | |
| --- | |
| **Step 3 β Update rule (multiplicative, not additive)** | |
| Here's what makes NMF different from SGD. Instead of adding/subtracting a gradient, | |
| NMF uses **multiplicative updates** that guarantee values stay non-negative: | |
| """ | |
| ) | |
| st.latex(r"W_{ij} \leftarrow W_{ij} \times \frac{(R \cdot H^T)_{ij}}{(W \cdot H \cdot H^T)_{ij}}") | |
| st.latex(r"H_{ij} \leftarrow H_{ij} \times \frac{(W^T \cdot R)_{ij}}{(W^T \cdot W \cdot H)_{ij}}") | |
| st.markdown( | |
| """ | |
| The key insight: we **multiply** by a ratio. Since all values are positive, | |
| the ratio is positive, so the result stays positive. If the ratio > 1, | |
| the value increases; if < 1, it decreases. Values can approach zero but | |
| never go negative. | |
| --- | |
| **Step 4 β After many iterations** | |
| | User | f1 ("kitchen") | f2 ("decor") | | |
| |---|---|---| | |
| | A | 2.1 | 0.0 | | |
| | B | 1.7 | 0.4 | | |
| | C | 0.0 | 2.3 | | |
| | Item | f1 ("kitchen") | f2 ("decor") | | |
| |---|---|---| | |
| | Cake Stand | 2.3 | 0.1 | | |
| | Candles | 0.5 | 1.6 | | |
| | Baking Set | 0.2 | 2.1 | | |
| | Bunting | 0.3 | 1.2 | | |
| Notice how some values are exactly 0 β NMF produces **sparse factors**. | |
| **Reading the result:** | |
| - Customer A: [2.1, 0.0] β purely a "kitchen" buyer, no "decor" interest | |
| - Customer C: [0.0, 2.3] β purely a "decor" buyer | |
| - Cake Stand: [2.3, 0.1] β almost entirely a "kitchen" item | |
| - Baking Set: [0.2, 2.1] β mostly a "decor" item | |
| --- | |
| **Step 5 β Predict missing values** | |
| Customer A β Baking Set (`?`): | |
| `predicted = [2.1, 0.0] Β· [0.2, 2.1] = 0.42 + 0.0 = 0.42` β low score | |
| Makes sense β A is a "kitchen" buyer, Baking Set is mostly "decor." | |
| Customer C β Cake Stand (`?`): | |
| `predicted = [0.0, 2.3] Β· [2.3, 0.1] = 0.0 + 0.23 = 0.23` β low score | |
| Makes sense β C is a "decor" buyer, Cake Stand is mostly "kitchen." | |
| --- | |
| **Why NMF factors are more interpretable:** | |
| | Method | Factor example | Interpretation | | |
| |---|---|---| | |
| | SVD | [0.7, -0.3] | "0.7 of something, minus 0.3 of something else" β hard to explain | | |
| | NMF | [2.1, 0.0] | "Strong kitchen interest, zero decor interest" β clear and additive | | |
| Each factor in NMF represents a positive contribution. A user's taste is a | |
| sum of parts, not a mix of additions and subtractions. | |
| """ | |
| ) | |
| # ββ Item-Based CF ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### 5. Item-Based Collaborative Filtering") | |
| st.markdown("*No factorization β just find similar items.*") | |
| st.graphviz_chart(""" | |
| digraph itemcf { | |
| rankdir=TB | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| bought [label="Customer bought\\nCake Stand", fillcolor="#dbeafe", color="#3b82f6"] | |
| sim [label="Find items with\\nsimilar purchase patterns\\n(cosine similarity)", fillcolor="#e0e7ff", color="#6366f1"] | |
| items [label="Similar items:\\n1. Cake Tins (0.85)\\n2. Baking Set (0.79)\\n3. Cookie Cutters (0.72)", fillcolor="#d1fae5", color="#10b981"] | |
| rec [label="Recommend\\nthese items!", fillcolor="#fef3c7", color="#f59e0b"] | |
| bought -> sim -> items -> rec | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **How it works:** | |
| 1. For every pair of items, compute **cosine similarity** based on their purchase patterns | |
| (items bought by similar sets of customers are similar) | |
| 2. When recommending for a user, look at items they've already bought | |
| 3. For each purchased item, find its most similar items | |
| 4. Rank by similarity score and recommend the top ones (excluding already-purchased) | |
| **Why cosine similarity?** It measures the angle between two vectors, ignoring magnitude. | |
| Two items with purchase patterns [1, 0, 1, 1] and [3, 0, 2, 4] have high cosine similarity | |
| (same direction) even though the quantities differ. | |
| **Strengths:** Simple, intuitive, no training needed, easy to explain. | |
| **Weaknesses:** Doesn't discover latent factors, doesn't work well with very sparse data, | |
| O(nΒ²) item-pair computation. | |
| """ | |
| ) | |
| with st.expander("Step-by-step numerical walkthrough β Item-Based CF", expanded=False): | |
| st.markdown( | |
| """ | |
| Item-CF takes a completely different approach: no vectors, no training, | |
| no latent factors. It works purely on similarity between items. | |
| **Starting point β the same user-item matrix:** | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Customer A** | 5 | 3 | 0 | 1 | | |
| | **Customer B** | 4 | 0 | 2 | 0 | | |
| | **Customer C** | 0 | 4 | 5 | 3 | | |
| (0 = not purchased) | |
| --- | |
| **Step 1 β Represent each item as a vector of who bought it** | |
| Each column is already a vector: | |
| - Cake Stand: [5, 4, 0] | |
| - Candles: [3, 0, 4] | |
| - Baking Set: [0, 2, 5] | |
| - Bunting: [1, 0, 3] | |
| --- | |
| **Step 2 β Compute cosine similarity between every pair of items** | |
| Cosine similarity measures the angle between two vectors (ignoring magnitude): | |
| """ | |
| ) | |
| st.latex(r"\text{cosine}(\vec{a}, \vec{b}) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| \times |\vec{b}|}") | |
| st.markdown( | |
| """ | |
| **Example β Cake Stand vs Candles:** | |
| `dot product = (5Γ3) + (4Γ0) + (0Γ4) = 15` | |
| `|Cake Stand| = β(5Β² + 4Β² + 0Β²) = β41 = 6.40` | |
| `|Candles| = β(3Β² + 0Β² + 4Β²) = β25 = 5.00` | |
| `cosine = 15 / (6.40 Γ 5.00) = 15 / 32.0 = 0.47` | |
| **Example β Candles vs Bunting:** | |
| `dot product = (3Γ1) + (0Γ0) + (4Γ3) = 15` | |
| `|Candles| = 5.00` | |
| `|Bunting| = β(1Β² + 0Β² + 3Β²) = β10 = 3.16` | |
| `cosine = 15 / (5.00 Γ 3.16) = 15 / 15.81 = 0.95` | |
| High similarity β Candles and Bunting are bought by the same types of customers. | |
| --- | |
| **Step 3 β Build the similarity matrix** | |
| After computing all pairs: | |
| | | Cake Stand | Candles | Baking Set | Bunting | | |
| |---|---|---|---|---| | |
| | **Cake Stand** | 1.00 | 0.47 | 0.31 | 0.16 | | |
| | **Candles** | 0.47 | 1.00 | 0.88 | 0.95 | | |
| | **Baking Set** | 0.31 | 0.88 | 1.00 | 0.93 | | |
| | **Bunting** | 0.16 | 0.95 | 0.93 | 1.00 | | |
| Candles, Baking Set, and Bunting cluster together (high mutual similarity). | |
| Cake Stand is somewhat different. | |
| --- | |
| **Step 4 β Generate recommendations for Customer B** | |
| Customer B bought: Cake Stand (4), Baking Set (2) | |
| For each item B hasn't bought, score it by similarity to what B already bought: | |
| **Candles:** | |
| `score = (sim(Candles, Cake Stand) Γ 4) + (sim(Candles, Baking Set) Γ 2)` | |
| `= (0.47 Γ 4) + (0.88 Γ 2) = 1.88 + 1.76 = 3.64` | |
| **Bunting:** | |
| `score = (sim(Bunting, Cake Stand) Γ 4) + (sim(Bunting, Baking Set) Γ 2)` | |
| `= (0.16 Γ 4) + (0.93 Γ 2) = 0.64 + 1.86 = 2.50` | |
| **Ranking:** Candles (3.64) > Bunting (2.50) | |
| β Recommend Candles first. | |
| --- | |
| **Why this works without any training:** | |
| There are no vectors to learn. The similarity matrix is computed directly | |
| from the data. When the data changes (new purchases), you just recompute | |
| the similarity scores β no model to retrain. | |
| **Trade-off:** This simplicity comes at a cost. With 10,000 products, you | |
| need to compute 50 million similarity pairs. Matrix factorization methods | |
| reduce each item to k numbers (e.g., 50), making everything faster. | |
| """ | |
| ) | |
| # ββ Comparison Table βββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### Quick Comparison") | |
| st.markdown( | |
| """ | |
| | | SVD | ALS | SGD (Funk SVD) | NMF | Item-Based CF | | |
| |---|---|---|---|---|---| | |
| | **Approach** | Decompose R = UΞ£Vα΅ | Alternate optimizing U and V | Gradient descent + bias terms | Decompose R β WΓH (non-negative) | Find similar items via cosine | | |
| | **Feedback type** | Explicit (treats 0 as 0) | Implicit (0 = unknown, not dislike) | Either (skips missing entries) | Non-negative values | Either | | |
| | **Bias terms** | No | Yes (implicit) | Yes (ΞΌ + bα΅€ + bα΅’) | No | No | | |
| | **Interpretability** | Low (negative factors) | Low | Low | High (additive parts) | Very high (similar items) | | |
| | **Scalability** | Good | Excellent (parallelizable) | Moderate (sequential updates) | Moderate | Poor for many items | | |
| | **Used by** | Research, baselines | Spotify, Netflix, LinkedIn | Netflix Prize, baselines | Topic modeling, parts decomposition | Amazon ("customers also bought") | | |
| """ | |
| ) | |
| # ββ Why Each Model ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### Why We Chose Each Model") | |
| st.markdown( | |
| """ | |
| #### SVD β The Mathematical Baseline | |
| | | | | |
| |---|---| | |
| | **Why we included it** | Same role as Logistic Regression in churn β the simplest factorization approach. If other models can't beat it, their complexity isn't adding value. | | |
| | **Where it's used** | Dimensionality reduction (PCA is SVD under the hood), text analysis (Latent Semantic Analysis), image compression. In recommendations: primarily as a baseline. | | |
| | **Key benefit** | One-step computation with no iterative training. Mathematically exact decomposition. | | |
| | **Limitation** | Treats every empty cell as zero β "didn't buy" is treated as "dislikes." This is a fundamental problem for e-commerce data. | | |
| | **When to pick this** | When you have explicit ratings (1β5 stars) with few missing values, or as a quick baseline before trying more sophisticated methods. | | |
| --- | |
| #### ALS β The Industry Standard | |
| | | | | |
| |---|---| | |
| | **Why we included it** | It's the go-to algorithm for implicit feedback (purchases, clicks, views) β which is exactly what our e-commerce data is. | | |
| | **Where it's used** | Spotify (Discover Weekly playlists), Netflix (movie recommendations), LinkedIn (job recommendations), Apple Music. | | |
| | **Key benefit** | Designed specifically for implicit data. Treats "didn't buy" as "unknown" (not "dislike"). Highly parallelizable β scales to billions of interactions. | | |
| | **Limitation** | Harder to implement from scratch. Requires careful tuning of regularization and confidence weights. | | |
| | **When to pick this** | When your data is implicit feedback (purchases, clicks, views) and you need to scale to large datasets. This is the default choice for most recommendation systems. | | |
| --- | |
| #### SGD (Funk SVD) β The Netflix Prize Winner | |
| | | | | |
| |---|---| | |
| | **Why we included it** | It's the algorithm that proved matrix factorization works for recommendations (Netflix Prize, 2006). It introduces bias terms that the other methods lack. | | |
| | **Where it's used** | The Netflix Prize competition. Research baselines. Systems where explicit ratings exist. Foundation for modern embedding-based models. | | |
| | **Key benefit** | Only trains on *observed* interactions (skips missing entries). Bias terms (ΞΌ + bα΅€ + bα΅’) anchor predictions to the correct scale. Flexible and well-understood. | | |
| | **Limitation** | Sequential updates (one rating at a time) β hard to parallelize. Slower than ALS on large datasets. Sensitive to learning rate. | | |
| | **When to pick this** | When you have explicit ratings and want the most theoretically grounded approach. Or when you need bias terms for better calibration. | | |
| --- | |
| #### NMF β The Interpretable Option | |
| | | | | |
| |---|---| | |
| | **Why we included it** | It produces factors you can actually explain β "this customer is 80% kitchen enthusiast, 20% vintage lover" β because all values are non-negative. | | |
| | **Where it's used** | Topic modeling in text analysis (discovering themes in documents), audio source separation, parts-based image decomposition, recommendation with interpretability requirements. | | |
| | **Key benefit** | The non-negativity constraint produces sparse, additive factors. Each factor represents a "part" of the user's taste. Easier to explain to business stakeholders than SVD or ALS. | | |
| | **Limitation** | Only works with non-negative data. Treats missing values as zero (same problem as SVD). Sensitive to initialization. | | |
| | **When to pick this** | When you need to explain *why* an item was recommended in terms of interpretable themes, or when factor analysis is a goal alongside recommendation. | | |
| --- | |
| #### Item-Based CF β The Simplest Approach | |
| | | | | |
| |---|---| | |
| | **Why we included it** | It's fundamentally different from the other four (no matrix factorization at all). Including it shows whether learning latent vectors actually helps versus just using raw similarity. | | |
| | **Where it's used** | Amazon ("Customers who bought this also bought..."), e-commerce product pages, any "similar items" widget. | | |
| | **Key benefit** | No training needed β just compute cosine similarity. Easy to explain: "We recommend this because people who bought what you bought also bought this." New items with a few purchases can be recommended immediately. | | |
| | **Limitation** | Requires computing similarity between every pair of items β O(nΒ²). Doesn't discover hidden patterns. Struggles with very sparse data. | | |
| | **When to pick this** | When you need a fast, explainable system without a training pipeline, or for "similar items" sections on product pages. | | |
| --- | |
| #### How They Complement Each Other | |
| | Role | Model | Why | | |
| |---|---|---| | |
| | **Baseline** | SVD | Sets the floor β simplest factorization | | |
| | **Production model** | ALS | Best for implicit feedback at scale | | |
| | **Calibrated predictions** | SGD | Bias terms give better-scaled scores | | |
| | **Interpretable factors** | NMF | When you need to explain *why* | | |
| | **Product page recs** | Item-CF | "Customers also bought" β no training needed | | |
| | **Live updates** | ALS (Live page) | Fast enough for repeated retraining | | |
| In a real deployment, you might use ALS for homepage "recommended for you," | |
| Item-CF for product pages ("similar items"), and NMF for internal analysis | |
| of customer segments. | |
| """ | |
| ) | |
| # ββ Full Picture βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### The Complete Training Pipeline") | |
| st.graphviz_chart(""" | |
| digraph pipeline { | |
| rankdir=TB | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| s1 [label="Start with random vectors\\n(user vectors + item vectors)", fillcolor="#dbeafe", color="#3b82f6"] | |
| s2 [label="Predict a rating\\nΞΌ + b_user + b_item + (uβ Β· vβ)", fillcolor="#d1fae5", color="#10b981"] | |
| s3 [label="Calculate error\\n(actual β predicted)", fillcolor="#fef3c7", color="#f59e0b"] | |
| s4 [label="Adjust vectors\\nto reduce error", fillcolor="#fce7f3", color="#ec4899"] | |
| s5 [label="Repeat for all ratings\\n(1 epoch)", fillcolor="#e0e7ff", color="#6366f1"] | |
| s6 [label="Repeat for many epochs\\nuntil error is small", fillcolor="#dbeafe", color="#3b82f6"] | |
| s7 [label="Vectors are now trained\\nUse them to predict missing values", fillcolor="#d1fae5", color="#10b981"] | |
| s1 -> s2 -> s3 -> s4 -> s5 -> s6 -> s7 | |
| s6 -> s2 [label="next epoch", style=dashed, constraint=false] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| **Where each algorithm differs:** | |
| | Method | How it adjusts vectors (Step 4) | Bias terms? | Skips missing entries? | | |
| |---|---|---|---| | |
| | **SGD (Funk SVD)** | Adjust one rating at a time using gradient descent | Yes (ΞΌ + bα΅€ + bα΅’) | Yes β only trains on observed data | | |
| | **ALS** | Fix all user vectors, solve all item vectors exactly (then swap) | Yes (ΞΌ + bα΅€ + bα΅’) | Yes β uses confidence weights | | |
| | **SVD** | No training loop β computes vectors mathematically in one step | No (operates on raw matrix) | No β treats missing as 0 | | |
| | **NMF** | Multiplicative update: multiply by a ratio to stay non-negative | No (factors absorb the scale) | No β treats missing as 0 | | |
| | **Item-CF** | No vectors at all β computes similarity directly from purchase patterns | No (score is weighted similarity sum) | N/A | | |
| All matrix factorization methods share the same goal: **find vectors for users and items | |
| such that their dot product approximates the observed ratings/purchase counts.** | |
| They just use different math to get there. | |
| """ | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 2 β Modern Architecture & Industry | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_modern: | |
| st.subheader("Modern Recommendation Architecture") | |
| st.markdown( | |
| "The algorithms we've built (SVD, SGD, ALS, NMF) are the **foundation** of recommendation systems. " | |
| "Here's how they evolved into what companies like YouTube, Spotify, and Netflix use today." | |
| ) | |
| # ββ Two-Stage Architecture βββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### Two-Stage Architecture") | |
| st.markdown( | |
| "Modern production recommenders don't use a single model. They split the problem " | |
| "into two stages to handle scale (millions of items) efficiently." | |
| ) | |
| st.graphviz_chart(""" | |
| digraph twostage { | |
| rankdir=TB | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| catalog [label="Full Product Catalog\\n(millions of items)", fillcolor="#dbeafe", color="#3b82f6"] | |
| subgraph cluster_stage1 { | |
| label="Stage 1 β Candidate Generation" | |
| style=dashed | |
| color="#6366f1" | |
| fontname="Helvetica" | |
| fontsize=11 | |
| tower [label="Two-Tower Neural Network\\nUser tower encodes user\\nItem tower encodes item", fillcolor="#e0e7ff", color="#6366f1"] | |
| embed [label="Embedding Similarity\\n(fast approximate\\nnearest neighbor search)", fillcolor="#e0e7ff", color="#6366f1"] | |
| cands [label="~1,000 Candidate Items\\n(rough filter, fast)", fillcolor="#fce7f3", color="#ec4899"] | |
| tower -> cands | |
| embed -> cands | |
| } | |
| subgraph cluster_stage2 { | |
| label="Stage 2 β Ranking Model" | |
| style=dashed | |
| color="#10b981" | |
| fontname="Helvetica" | |
| fontsize=11 | |
| features [label="Rich Features\\nuser profile + item metadata\\n+ context + history", fillcolor="#d1fae5", color="#10b981"] | |
| dnn [label="Deep Neural Network\\nscores each candidate", fillcolor="#d1fae5", color="#10b981"] | |
| ranked [label="Top 10-50 items\\n(precise ranking)", fillcolor="#fef3c7", color="#f59e0b"] | |
| features -> dnn -> ranked | |
| } | |
| catalog -> tower | |
| catalog -> embed | |
| cands -> features | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| | Stage | Purpose | Speed vs Quality | What we built | | |
| |---|---|---|---| | |
| | **Candidate Generation** | Quickly narrow millions of items to ~1,000 plausible candidates | Fast, lower precision | Our SVD/SGD/ALS/NMF β they produce a rough score for every item | | |
| | **Ranking** | Precisely rank those ~1,000 using rich features | Slower, high precision | Not in our demo β requires deep learning infrastructure | | |
| **Why two stages?** Scoring all items with a complex deep network is too slow. | |
| The first stage (where our algorithms live) acts as a fast filter. The second | |
| stage applies a heavyweight model only to the surviving candidates. | |
| """ | |
| ) | |
| # ββ What Matrix Factorization Became βββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### How Matrix Factorization Evolved Into Embeddings") | |
| st.graphviz_chart(""" | |
| digraph evolution { | |
| rankdir=LR | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| mf [label="Matrix Factorization\\n(2006-2015)\\n\\nuser_vector (k dims)\\nitem_vector (k dims)\\n\\nScore = dot product", fillcolor="#dbeafe", color="#3b82f6"] | |
| embed [label="Embedding Learning\\n(2015-present)\\n\\nuser_embedding (k dims)\\nitem_embedding (k dims)\\n\\nLearned inside neural nets", fillcolor="#d1fae5", color="#10b981"] | |
| deep [label="Deep Recommender\\n(current state)\\n\\nuser_embedding\\nitem_embedding\\n+ context features\\n+ sequence history\\n+ non-linear layers", fillcolor="#fef3c7", color="#f59e0b"] | |
| mf -> embed [label=" same concept,\\n new training"] | |
| embed -> deep [label=" add more\\n signals"] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| Matrix factorization didn't disappear β it became the foundation of modern embeddings: | |
| | Concept | Matrix Factorization (what we built) | Deep Recommender (industry today) | | |
| |---|---|---| | |
| | **User representation** | User vector from SGD/ALS (k numbers) | User embedding learned by neural network | | |
| | **Item representation** | Item vector from SGD/ALS (k numbers) | Item embedding learned by neural network | | |
| | **Score computation** | Dot product: `user_vec Β· item_vec` | Neural network with multiple layers | | |
| | **Input data** | Interactions only (purchases, clicks) | Interactions + user profile + item metadata + time + context | | |
| | **Relationships** | Linear (dot product) | Non-linear (neural network can learn complex patterns) | | |
| | **New items** | Cannot handle (no interactions = no vector) | Can use item features (title, category, image) | | |
| The user/item vectors from our ALS model are conceptually identical to the | |
| embeddings inside a modern deep recommender β the difference is how they're | |
| trained and what additional signals they incorporate. | |
| """ | |
| ) | |
| # ββ YouTube Example ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### Real Industry Example β YouTube") | |
| st.markdown( | |
| "YouTube's recommendation system (described in their 2016 paper *Deep Neural Networks " | |
| "for YouTube Recommendations*) follows the exact two-stage pattern:" | |
| ) | |
| st.graphviz_chart(""" | |
| digraph youtube { | |
| rankdir=TB | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| corpus [label="YouTube Video Corpus\\n(hundreds of millions of videos)", fillcolor="#dbeafe", color="#3b82f6"] | |
| subgraph cluster_yt1 { | |
| label="Candidate Generation" | |
| style=dashed | |
| color="#6366f1" | |
| fontname="Helvetica" | |
| yt_tower [label="Two-Tower DNN\\n\\nInputs: watch history,\\nsearch history, demographics\\n\\nOutput: user embedding", fillcolor="#e0e7ff", color="#6366f1"] | |
| yt_ann [label="Approximate Nearest\\nNeighbor Search\\n\\nβ ~hundreds of\\nvideo candidates", fillcolor="#fce7f3", color="#ec4899"] | |
| } | |
| subgraph cluster_yt2 { | |
| label="Ranking" | |
| style=dashed | |
| color="#10b981" | |
| fontname="Helvetica" | |
| yt_features [label="Hundreds of Features\\n\\nvideo age, channel quality,\\nuser language, device,\\ntime of day, watch %...", fillcolor="#d1fae5", color="#10b981"] | |
| yt_rank [label="Deep Ranking Network\\n\\nPredicts: P(watch)\\nP(like), P(share)...", fillcolor="#fef3c7", color="#f59e0b"] | |
| } | |
| yt_result [label="Top ~20 videos\\nshown to user", fillcolor="#d1fae5", color="#10b981"] | |
| corpus -> yt_tower -> yt_ann -> yt_features -> yt_rank -> yt_result | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| | Component | YouTube's Approach | Our Demo Equivalent | | |
| |---|---|---| | |
| | User representation | Learned embedding from watch/search history | User vector from SGD/ALS/SVD | | |
| | Item representation | Video embedding from features | Item vector from SGD/ALS/SVD | | |
| | Candidate generation | Two-tower DNN + ANN search | Our full prediction matrix | | |
| | Ranking | Deep network with hundreds of features | Not in our demo | | |
| | Scale | Hundreds of millions of videos, billions of users | 3,665 items, 4,338 users | | |
| """ | |
| ) | |
| # ββ Limitations of MF ββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| st.markdown("---") | |
| st.markdown("### Limitations of Matrix Factorization vs Deep Learning") | |
| st.markdown( | |
| """ | |
| | Limitation | Matrix Factorization (our models) | Deep Learning Solution | | |
| |---|---|---| | |
| | **Only uses interactions** | Can only learn from purchases β knows nothing about the product itself | Can incorporate item metadata (description, category, price, images) | | |
| | **Ignores context** | Same recommendation regardless of time, device, or situation | Uses context: time of day, device, location, season | | |
| | **No sequence awareness** | Treats all purchases equally β doesn't know the order | Sequence models (RNN, Transformer) understand "watched A, then B, likely wants C" | | |
| | **Linear relationships** | Score = dot product (linear combination of factors) | Neural networks learn non-linear patterns and complex interactions | | |
| | **Cold start** | New item with no purchases = no vector = cannot recommend | Can use item features to generate embedding even with zero interactions | | |
| | **Static** | User vector doesn't change between retraining | Real-time models update user embedding with every interaction | | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.markdown("### Where Our Demo Fits") | |
| st.graphviz_chart(""" | |
| digraph spectrum { | |
| rankdir=LR | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.2, style=dashed] | |
| pop [label="Popularity\\nBaseline\\n(recommend\\nbest-sellers)", fillcolor="#fee2e2", color="#ef4444"] | |
| cf [label="Item-Based CF\\n(cosine similarity)\\n\\nOur demo β", fillcolor="#fef3c7", color="#f59e0b"] | |
| mf [label="Matrix Factorization\\n(SVD, SGD, ALS, NMF)\\n\\nOur demo β", fillcolor="#d1fae5", color="#10b981"] | |
| fm [label="Factorization\\nMachines\\n(adds features\\nto MF)", fillcolor="#e0e7ff", color="#6366f1"] | |
| deep [label="Deep Learning\\n(two-tower,\\nTransformer)\\n\\nIndustry today", fillcolor="#dbeafe", color="#3b82f6"] | |
| pop -> cf -> mf -> fm -> deep | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| Our demo implements the **middle of the spectrum** β the algorithms that | |
| established the field and are still used as candidate generators in production | |
| systems. Matrix factorization remains the conceptual backbone: the user and item | |
| vectors we learn with ALS are the direct ancestors of the embeddings inside | |
| YouTube's, Spotify's, and Netflix's deep recommenders. | |
| **For a data science project, this is the right starting point.** You demonstrate | |
| the core concepts (latent factors, matrix decomposition, bias terms, implicit feedback), | |
| then scale up to deep learning when the business case justifies the infrastructure. | |
| """ | |
| ) | |
| st.markdown("---") | |
| st.info( | |
| "**Note:** The deep learning model described above (YouTube's **Two-Tower Deep Neural Network** " | |
| "with separate Candidate Generation and Ranking stages) is **explained but not implemented** " | |
| "in this demo. It requires TensorFlow/PyTorch, GPU resources, and significantly more data " | |
| "(YouTube trains on billions of interactions β our dataset has ~215K). Our 5 implemented models " | |
| "(SVD, ALS, SGD, NMF, Item-CF) are the foundational algorithms that YouTube's deep system evolved from." | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 3 β Model Comparison | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_compare: | |
| st.subheader("Model Performance Comparison") | |
| st.markdown( | |
| "We train all five algorithms on the same training set and evaluate " | |
| "on held-out test interactions using Precision@10, Recall@10, and Hit Rate." | |
| ) | |
| interactions, user_map, item_map, item_desc, _ = build_interaction_matrix() | |
| train_df, train_sparse, test_df, _, _, _ = get_rec_train_test() | |
| train_dense = train_df.values.astype(np.float32) | |
| user_index = {cust: i for i, cust in enumerate(train_df.index)} | |
| item_columns = list(train_df.columns) | |
| n_factors = st.slider("Number of latent factors (k)", 10, 100, 50, step=10) | |
| st.caption( | |
| "π‘ **Note:** Training results are cached by Streamlit. When you change k and click train, " | |
| "the models will retrain with the new value. Subsequent clicks with the same k will be instant (cached). " | |
| "Watch the timing indicators to see actual training vs. cache hits." | |
| ) | |
| # Show if results exist from previous training | |
| if "rec_results" in st.session_state and "rec_n_factors" in st.session_state: | |
| prev_k = st.session_state.rec_n_factors | |
| if prev_k == n_factors: | |
| st.info(f"π Results from previous training with k={n_factors} are shown below. Click train button to refresh.") | |
| else: | |
| st.warning(f"β οΈ Current results are from k={prev_k}. Click train button to retrain with k={n_factors}.") | |
| if st.button("Train & Evaluate All Models", type="primary", use_container_width=True): | |
| import time | |
| start_time = time.time() | |
| results = {} | |
| progress_bar = st.progress(0, text="Starting training...") | |
| with st.spinner("Training SVD..."): | |
| t0 = time.time() | |
| svd_pred, *_ = train_svd(train_sparse, n_factors=n_factors) | |
| results["SVD"] = evaluate_recommendations(svd_pred, train_sparse, test_df, user_index, item_columns) | |
| elapsed = time.time() - t0 | |
| st.success(f"SVD trained ({elapsed:.2f}s)") | |
| progress_bar.progress(0.2, text="SVD complete, training ALS...") | |
| with st.spinner("Training ALS..."): | |
| t0 = time.time() | |
| als_model = train_als(train_sparse, n_factors=n_factors) | |
| als_user = als_model.user_factors | |
| als_item = als_model.item_factors | |
| als_pred = als_user @ als_item.T | |
| results["ALS"] = evaluate_recommendations(als_pred, train_sparse, test_df, user_index, item_columns) | |
| elapsed = time.time() - t0 | |
| st.success(f"ALS trained ({elapsed:.2f}s)") | |
| progress_bar.progress(0.4, text="ALS complete, training SGD...") | |
| with st.spinner("Training SGD (Funk SVD) β this may take ~60s on first run..."): | |
| t0 = time.time() | |
| sgd_pred, *_ = train_sgd(train_sparse, n_factors=n_factors, n_epochs=30) | |
| results["SGD"] = evaluate_recommendations(sgd_pred, train_sparse, test_df, user_index, item_columns) | |
| elapsed = time.time() - t0 | |
| st.success(f"SGD trained ({elapsed:.2f}s)") | |
| progress_bar.progress(0.6, text="SGD complete, training NMF...") | |
| with st.spinner("Training NMF..."): | |
| t0 = time.time() | |
| nmf_pred, *_ = train_nmf(train_dense, n_factors=n_factors) | |
| results["NMF"] = evaluate_recommendations(nmf_pred, train_sparse, test_df, user_index, item_columns) | |
| elapsed = time.time() - t0 | |
| st.success(f"NMF trained ({elapsed:.2f}s)") | |
| progress_bar.progress(0.8, text="NMF complete, training Item-CF...") | |
| with st.spinner("Training Item-Based CF..."): | |
| t0 = time.time() | |
| cf_pred, _ = train_item_cf(train_sparse) | |
| results["Item-CF"] = evaluate_recommendations(cf_pred, train_sparse, test_df, user_index, item_columns) | |
| elapsed = time.time() - t0 | |
| st.success(f"Item-Based CF trained ({elapsed:.2f}s)") | |
| progress_bar.progress(1.0, text="All models trained!") | |
| total_time = time.time() - start_time | |
| st.info(f"β All 5 models trained with **k={n_factors} latent factors** in {total_time:.2f}s total") | |
| # Store in session state for persistence | |
| st.session_state.rec_results = results | |
| st.session_state.rec_n_factors = n_factors | |
| st.session_state.rec_svd_pred = svd_pred | |
| st.session_state.rec_als_pred = als_pred | |
| st.session_state.rec_sgd_pred = sgd_pred | |
| st.session_state.rec_nmf_pred = nmf_pred | |
| st.session_state.rec_cf_pred = cf_pred | |
| st.session_state["rec_models_trained"] = True | |
| st.session_state["svd_pred"] = svd_pred | |
| st.session_state["als_pred"] = als_pred | |
| st.session_state["sgd_pred"] = sgd_pred | |
| st.session_state["nmf_pred"] = nmf_pred | |
| st.session_state["cf_pred"] = cf_pred | |
| st.session_state["train_df"] = train_df | |
| st.session_state["item_desc"] = item_desc | |
| st.session_state["item_map"] = item_map | |
| st.session_state["user_map"] = user_map | |
| st.rerun() | |
| # Display results and metrics explanation section | |
| if "rec_results" in st.session_state and st.session_state.get("rec_models_trained", False): | |
| st.markdown("---") | |
| st.markdown("### Results") | |
| results = st.session_state.rec_results | |
| k_used = st.session_state.rec_n_factors | |
| st.caption(f"Results from training with k={k_used} latent factors") | |
| metrics_df = pd.DataFrame(results).T | |
| metrics_df = metrics_df[["Precision@K", "Recall@K", "Hit Rate", "RMSE", "RΒ²", "Users Evaluated"]] | |
| # Format for display | |
| display_df = metrics_df.copy() | |
| display_df["Precision@K"] = display_df["Precision@K"].map("{:.4f}".format) | |
| display_df["Recall@K"] = display_df["Recall@K"].map("{:.4f}".format) | |
| display_df["Hit Rate"] = display_df["Hit Rate"].map("{:.2%}".format) | |
| display_df["RMSE"] = display_df["RMSE"].map("{:.2f}".format) | |
| display_df["RΒ²"] = display_df["RΒ²"].map("{:.4f}".format) | |
| display_df["Users Evaluated"] = display_df["Users Evaluated"].apply(lambda x: str(int(x))) | |
| st.dataframe(display_df, use_container_width=True) | |
| st.info( | |
| "**Metric Guide:**\n\n" | |
| "- **Precision@K**: Of the K items recommended, what % were actually purchased? (Higher is better)\n" | |
| "- **Recall@K**: Of all items the user purchased, what % did we recommend? (Higher is better)\n" | |
| "- **Hit Rate**: % of users who got at least one relevant recommendation (Higher is better)\n" | |
| "- **RMSE**: Root Mean Squared Error - measures prediction accuracy for quantities (Lower is better)\n" | |
| "- **RΒ²**: How well the model explains variance in purchase quantities (1.0 = perfect, 0 = baseline, negative = worse than baseline)" | |
| ) | |
| raw_results = pd.DataFrame(results).T | |
| # Create two separate charts: one for ranking metrics, one for prediction metrics | |
| col1, col2 = st.columns(2) | |
| with col1: | |
| st.markdown("**Ranking Metrics** (Higher is Better)") | |
| fig_ranking = go.Figure() | |
| for metric in ["Precision@K", "Recall@K", "Hit Rate"]: | |
| fig_ranking.add_trace(go.Bar( | |
| name=metric, | |
| x=list(results.keys()), | |
| y=raw_results[metric].values, | |
| text=[f"{v:.4f}" for v in raw_results[metric].values], | |
| textposition="auto", | |
| )) | |
| fig_ranking.update_layout( | |
| barmode="group", | |
| yaxis_title="Score", | |
| height=400, | |
| showlegend=True, | |
| ) | |
| st.plotly_chart(fig_ranking, use_container_width=True) | |
| with col2: | |
| st.markdown("**Prediction Metrics**") | |
| fig_pred = go.Figure() | |
| # RMSE (lower is better) - show on left y-axis | |
| fig_pred.add_trace(go.Bar( | |
| name="RMSE (lower is better)", | |
| x=list(results.keys()), | |
| y=raw_results["RMSE"].values, | |
| text=[f"{v:.2f}" for v in raw_results["RMSE"].values], | |
| textposition="auto", | |
| marker_color="lightcoral", | |
| )) | |
| fig_pred.update_layout( | |
| yaxis_title="RMSE (lower is better)", | |
| height=400, | |
| showlegend=True, | |
| ) | |
| st.plotly_chart(fig_pred, use_container_width=True) | |
| # RΒ² in a separate small chart | |
| st.markdown("**RΒ² Score** (1.0 = perfect, 0 = baseline)") | |
| fig_r2 = go.Figure() | |
| fig_r2.add_trace(go.Bar( | |
| x=list(results.keys()), | |
| y=raw_results["RΒ²"].values, | |
| text=[f"{v:.3f}" for v in raw_results["RΒ²"].values], | |
| textposition="auto", | |
| marker_color="lightblue", | |
| showlegend=False, | |
| )) | |
| fig_r2.update_layout( | |
| yaxis_title="RΒ² Score", | |
| height=300, | |
| ) | |
| st.plotly_chart(fig_r2, use_container_width=True) | |
| st.markdown("---") | |
| st.markdown("### Understanding These Metrics") | |
| st.markdown( | |
| """ | |
| #### How We Test | |
| Before training, we **hide** some items each customer actually purchased (the test set). | |
| After training, we ask each model: *"What are your top 10 recommendations for this customer?"* | |
| Then we check how many of those recommendations match the hidden items. | |
| """ | |
| ) | |
| st.graphviz_chart(""" | |
| digraph eval { | |
| rankdir=LR | |
| node [shape=box, style="rounded,filled", fontname="Helvetica", fontsize=10, margin="0.25,0.12"] | |
| edge [color="#888888", penwidth=1.5] | |
| full [label="Customer's full\\npurchase history\\n(100 items)", fillcolor="#dbeafe", color="#3b82f6"] | |
| train [label="Training set\\n(80 items)\\nModel sees these", fillcolor="#d1fae5", color="#10b981"] | |
| test [label="Test set\\n(20 items)\\nHidden from model", fillcolor="#fee2e2", color="#ef4444"] | |
| rec [label="Model's top 10\\nrecommendations", fillcolor="#fef3c7", color="#f59e0b"] | |
| check [label="How many of the 10\\nare in the hidden 20?", fillcolor="#e0e7ff", color="#6366f1"] | |
| full -> train | |
| full -> test | |
| train -> rec [label=" model trained"] | |
| rec -> check | |
| test -> check [label=" compare"] | |
| } | |
| """) | |
| st.markdown( | |
| """ | |
| #### Example β Customer #12345 | |
| This customer bought 100 products total. We hide 20 of them for testing. | |
| The model trains on the remaining 80. Then we ask for 10 recommendations. | |
| | Model's Top 10 Recommendations | In the hidden 20? | | |
| |---|---| | |
| | 1. Cake Stand | No | | |
| | 2. Vintage Teacup Set | **Yes** (was in hidden test set) | | |
| | 3. Baking Tins | No | | |
| | 4. Candle Holder | No | | |
| | 5. Recipe Book Stand | **Yes** (was in hidden test set) | | |
| | 6. Napkin Set | No | | |
| | 7. Cookie Cutters | No | | |
| | 8. Storage Jar | No | | |
| | 9. Tea Towel | No | | |
| | 10. Egg Cup Set | No | | |
| **2 out of 10** recommendations matched hidden purchases. | |
| --- | |
| #### What Each Metric Means for This Customer | |
| """ | |
| ) | |
| st.latex(r"\text{Precision@10} = \frac{\text{correct recommendations}}{\text{total recommendations}} = \frac{2}{10} = 0.20") | |
| st.markdown( | |
| """ | |
| *"20% of what we recommended was actually relevant."* | |
| If you showed these 10 products on a homepage, 2 of them would be items | |
| the customer genuinely wanted. The other 8 might still be interesting | |
| (the customer just hasn't bought them *yet*), but we can only verify the ones in our test set. | |
| """ | |
| ) | |
| st.latex(r"\text{Recall@10} = \frac{\text{correct recommendations}}{\text{total hidden items}} = \frac{2}{20} = 0.10") | |
| st.markdown( | |
| """ | |
| *"We found 10% of the items the customer actually wanted."* | |
| The customer had 20 hidden items we needed to find. We found 2 of them in our top 10. | |
| With only 10 recommendation slots and 3,665 total products, finding even 2 out of 20 | |
| is meaningful β pure random guessing would find ~0.05 items. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| **Hit Rate** asks a simpler question: *"Did the customer get at least one good recommendation?"* | |
| For this customer, the answer is **Yes** (2 hits > 0). If we repeat this across all 4,080 | |
| evaluated customers: | |
| """ | |
| ) | |
| st.latex(r"\text{Hit Rate} = \frac{\text{customers with} \geq 1 \text{ hit}}{\text{total customers evaluated}}") | |
| st.markdown( | |
| """ | |
| --- | |
| #### Full Calculation β All 3 Metrics Across 5 Customers | |
| Here's how the final numbers are computed. Suppose we evaluate 5 customers: | |
| | Customer | Hidden Items | Model's Top 10 | Hits | Precision@10 | Recall@10 | Hit? | | |
| |---|---|---|---|---|---|---| | |
| | A | 20 items | 2 matched | 2 | 2/10 = 0.20 | 2/20 = 0.10 | Yes | | |
| | B | 8 items | 0 matched | 0 | 0/10 = 0.00 | 0/8 = 0.00 | No | | |
| | C | 15 items | 1 matched | 1 | 1/10 = 0.10 | 1/15 = 0.07 | Yes | | |
| | D | 5 items | 0 matched | 0 | 0/10 = 0.00 | 0/5 = 0.00 | No | | |
| | E | 12 items | 3 matched | 3 | 3/10 = 0.30 | 3/12 = 0.25 | Yes | | |
| """ | |
| ) | |
| st.latex(r"\text{Precision@10} = \text{average} = \frac{0.20 + 0.00 + 0.10 + 0.00 + 0.30}{5} = 0.12") | |
| st.latex(r"\text{Recall@10} = \text{average} = \frac{0.10 + 0.00 + 0.07 + 0.00 + 0.25}{5} = 0.084") | |
| st.latex(r"\text{Hit Rate} = \frac{\text{3 customers with hits}}{\text{5 total}} = 60\%") | |
| st.markdown( | |
| """ | |
| In our actual evaluation, this calculation runs across **~4,080 customers** instead of 5. | |
| **How to present these numbers:** | |
| - Precision@10 = 0.12 means "on average, 1.2 out of every 10 recommendations are items the customer actually wanted" | |
| - Recall@10 = 0.084 means "on average, the model finds 8.4% of a customer's wanted items in just 10 slots" | |
| - Hit Rate = 60% means "60% of customers got at least one relevant recommendation" | |
| --- | |
| #### Why the Numbers Look Small β And Why That's Normal | |
| | Metric | Typical Value | Seems low? | Why it's actually good | | |
| |---|---|---|---| | |
| | Precision@10 | 0.01 β 0.10 | Yes | With 3,665 products, finding *any* relevant item in 10 slots is hard. Random guessing gives ~0.003. | | |
| | Recall@10 | 0.005 β 0.05 | Yes | A customer may have 20+ hidden items. Finding even 1 in 10 slots means the model understood their taste. | | |
| | Hit Rate | 3% β 15% | Maybe | This means 3β15% of customers got at least one useful recommendation out of 10. At scale, that drives revenue. | | |
| --- | |
| #### RMSE and RΒ² β Prediction Accuracy Metrics | |
| While Precision/Recall/Hit Rate measure **ranking quality** ("Did we recommend the right items?"), | |
| RMSE and RΒ² measure **prediction accuracy** ("How well did we predict purchase quantities?"). | |
| **RMSE (Root Mean Squared Error):** | |
| """ | |
| ) | |
| st.latex(r"\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\text{actual}_i - \text{predicted}_i)^2}") | |
| st.markdown( | |
| """ | |
| **What it measures:** Average error in predicting how many units a customer will buy. | |
| **Example:** | |
| | Customer | Item | Actual Quantity | Predicted Quantity | Error | Squared Error | | |
| |---|---|---|---|---|---| | |
| | A | Teacup | 5 | 4.2 | -0.8 | 0.64 | | |
| | A | Candles | 3 | 5.1 | +2.1 | 4.41 | | |
| | B | Cake Stand | 2 | 1.8 | -0.2 | 0.04 | | |
| | B | Bunting | 8 | 6.5 | -1.5 | 2.25 | | |
| | C | Baking Set | 1 | 1.3 | +0.3 | 0.09 | | |
| RMSE = β[(0.64 + 4.41 + 0.04 + 2.25 + 0.09) / 5] = β(7.43 / 5) = β1.486 = **1.22 units** | |
| **Interpretation:** On average, our predictions are off by about 1.22 units. | |
| - **Lower is better** (0 = perfect predictions) | |
| - RMSE = 1.5 means predictions are typically within Β±1.5 units of actual | |
| - Useful when quantity matters (inventory planning, revenue forecasting) | |
| **When RMSE is important:** | |
| - Predicting how much inventory to stock | |
| - Forecasting revenue from recommendations | |
| - Understanding if the model captures purchase patterns accurately | |
| --- | |
| **RΒ² (R-squared / Coefficient of Determination):** | |
| """ | |
| ) | |
| st.latex(r"R^2 = 1 - \frac{\sum (\text{actual} - \text{predicted})^2}{\sum (\text{actual} - \text{mean})^2}") | |
| st.markdown( | |
| """ | |
| **What it measures:** How much variance in purchase quantities the model explains. | |
| **Think of it as:** "How much better is the model than just guessing the average?" | |
| **Example:** | |
| Suppose the average purchase quantity across all items is 3.5 units. | |
| | Customer | Item | Actual | Mean Baseline | Model Prediction | Baseline ErrorΒ² | Model ErrorΒ² | | |
| |---|---|---|---|---|---|---| | |
| | A | Teacup | 5 | 3.5 | 4.2 | (5-3.5)Β² = 2.25 | (5-4.2)Β² = 0.64 | | |
| | A | Candles | 3 | 3.5 | 5.1 | (3-3.5)Β² = 0.25 | (3-5.1)Β² = 4.41 | | |
| | B | Cake Stand | 2 | 3.5 | 1.8 | (2-3.5)Β² = 2.25 | (2-1.8)Β² = 0.04 | | |
| | B | Bunting | 8 | 3.5 | 6.5 | (8-3.5)Β² = 20.25 | (8-6.5)Β² = 2.25 | | |
| | C | Baking Set | 1 | 3.5 | 1.3 | (1-3.5)Β² = 6.25 | (1-1.3)Β² = 0.09 | | |
| Sum of baseline errorsΒ² = 31.25 | |
| Sum of model errorsΒ² = 7.43 | |
| RΒ² = 1 - (7.43 / 31.25) = 1 - 0.238 = **0.762** | |
| **Interpretation:** | |
| | RΒ² Value | Meaning | | |
| |---|---| | |
| | **1.0** | Perfect predictions (model explains 100% of variance) | | |
| | **0.76** | Model explains 76% of variance (good!) | | |
| | **0.5** | Model explains 50% of variance (moderate) | | |
| | **0.0** | Model is no better than just predicting the mean | | |
| | **Negative** | Model is worse than predicting the mean (bad fit!) | | |
| **Why RΒ² can be negative in recommendations:** | |
| Unlike regression on continuous data, recommendation systems face challenges: | |
| - Sparse data (most user-item pairs are empty) | |
| - Implicit feedback (purchases don't mean "liked X amount") | |
| - Cold start (new users/items have no history) | |
| A negative RΒ² means the model's predictions are worse than just guessing the average | |
| purchase quantity. This often happens with: | |
| - SVD (treats missing as 0, distorts predictions) | |
| - NMF (non-negativity constraint can be too restrictive) | |
| - Models that overfit to training data | |
| **When RΒ² is important:** | |
| - Comparing models on the same dataset | |
| - Understanding if latent factors capture meaningful patterns | |
| - Deciding if a complex model justifies its complexity | |
| --- | |
| #### Ranking Metrics vs Prediction Metrics β Which to Use? | |
| | Scenario | Use These Metrics | Why | | |
| |---|---|---| | |
| | **Homepage "Recommended for You"** | Precision@K, Recall@K, Hit Rate | You care about showing relevant items, not predicting exact quantities | | |
| | **"You might also like" section** | Precision@K, Hit Rate | Ranking quality matters most | | |
| | **Inventory planning** | RMSE, RΒ² | You need accurate quantity predictions | | |
| | **Revenue forecasting** | RMSE, RΒ² | Dollar amounts depend on predicted quantities | | |
| | **A/B testing recommendations** | All metrics | Compare both ranking quality and prediction accuracy | | |
| | **Model selection** | Precision@K + RMSE | Balance relevance with prediction accuracy | | |
| **In practice:** | |
| - **Precision@K** and **Hit Rate** are most important for user-facing recommendations | |
| - **RMSE** and **RΒ²** are useful for business planning and understanding model quality | |
| - A model can have good ranking (high Precision@K) but poor quantity prediction (low RΒ²) | |
| - For e-commerce, ranking usually matters more than exact quantities | |
| --- | |
| #### Simple Explanation for Non-Technical Stakeholders | |
| **Precision@K:** "Out of 10 recommendations, how many did the customer actually want?" | |
| β Measures recommendation relevance | |
| **Hit Rate:** "What % of customers got at least one good recommendation?" | |
| β Measures overall success rate | |
| **RMSE:** "When we predict a customer will buy 5 units, how far off are we on average?" | |
| β Measures prediction error in units | |
| **RΒ²:** "How much better are our predictions than just guessing the average?" | |
| β Measures model quality (1.0 = perfect, 0 = no better than average, negative = worse than average) | |
| **The business perspective:** If you have 100,000 customers and Hit Rate is 8%, that's | |
| 8,000 customers who received a relevant product suggestion they wouldn't have found otherwise. | |
| Even a 2% conversion rate on those turns into 160 additional sales β from a fully automated system. | |
| **Comparing models:** The absolute numbers matter less than the *relative ranking*. | |
| If ALS gets 8% Hit Rate and SVD gets 5%, ALS is clearly better at learning from | |
| this data β even though both numbers seem small in isolation. | |
| --- | |
| #### Why Random Baselines Help Contextualize | |
| For reference, a random recommender on our dataset would achieve approximately: | |
| - **Precision@10 β 0.003** (3,665 items, ~20 relevant β 20/3,665 chance per slot) | |
| - **Hit Rate β 5%** (10 random tries at a 0.5% per-item hit rate) | |
| Any model scoring above these baselines has learned real patterns from the data. | |
| """ | |
| ) | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Tab 3 β Interactive Recommendations | |
| # βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| with tab_demo: | |
| st.subheader("Interactive Recommendation Demo") | |
| st.markdown( | |
| "Select a customer and see what each model recommends. " | |
| "You can also see what the customer has actually purchased to judge quality." | |
| ) | |
| if not st.session_state.get("rec_models_trained"): | |
| st.warning("Please train the models first in the **Model Comparison** tab.") | |
| else: | |
| train_df = st.session_state["train_df"] | |
| item_desc = st.session_state["item_desc"] | |
| item_map = st.session_state["item_map"] | |
| customer_list = list(train_df.index) | |
| active_customers = [c for c in customer_list if (train_df.loc[c] > 0).sum() >= 10] | |
| selected_customer = st.selectbox( | |
| "Select a Customer", | |
| active_customers[:100], | |
| format_func=lambda x: f"Customer {int(x)} ({int((train_df.loc[x] > 0).sum())} products purchased)", | |
| ) | |
| user_idx = customer_list.index(selected_customer) | |
| st.markdown("#### What this customer has purchased") | |
| purchased = train_df.loc[selected_customer] | |
| purchased_items = purchased[purchased > 0].sort_values(ascending=False).head(15) | |
| purch_display = pd.DataFrame({ | |
| "StockCode": purchased_items.index, | |
| "Quantity": purchased_items.values.astype(int), | |
| "Description": [item_desc.get(sc, "N/A") for sc in purchased_items.index], | |
| }) | |
| st.dataframe(purch_display, use_container_width=True, hide_index=True) | |
| st.markdown("---") | |
| st.markdown("#### Recommendations from each model") | |
| model_preds = { | |
| "SVD": st.session_state["svd_pred"], | |
| "ALS": st.session_state["als_pred"], | |
| "SGD (Funk SVD)": st.session_state["sgd_pred"], | |
| "NMF": st.session_state["nmf_pred"], | |
| "Item-CF": st.session_state["cf_pred"], | |
| } | |
| n_recs = st.slider("Number of recommendations", 5, 20, 10) | |
| cols = st.columns(2) | |
| for i, (model_name, pred_matrix) in enumerate(model_preds.items()): | |
| with cols[i % 2]: | |
| st.markdown(f"**{model_name}**") | |
| top_items, top_scores = get_top_n_recommendations( | |
| pred_matrix, train_df.values, user_idx, n=n_recs, | |
| ) | |
| rec_display = pd.DataFrame({ | |
| "Rank": range(1, len(top_items) + 1), | |
| "StockCode": [train_df.columns[j] for j in top_items], | |
| "Score": [f"{s:.3f}" for s in top_scores], | |
| "Description": [item_desc.get(train_df.columns[j], "N/A") for j in top_items], | |
| }) | |
| st.dataframe(rec_display, use_container_width=True, hide_index=True) | |
| st.markdown("---") | |
| st.markdown( | |
| "**How to evaluate these recommendations:** Look at the customer's purchase " | |
| "history above and check if the recommended items make sense. For example, if " | |
| "the customer bought a lot of kitchen items, good recommendations would include " | |
| "other kitchen products they haven't purchased yet." | |
| ) | |