{ "cells": [ { "cell_type": "markdown", "id": "ae798d43", "metadata": {}, "source": [ "# 🚀 End-to-End Sequential Recommender System \n", "\n", "This project implements and evaluates a series of recommender system models, culminating in a state-of-the-art **SASRec (Self-Attentive Sequential Recommendation)** model for Top-N next-item prediction. The system is trained on the [RetailRocket e-commerce dataset](https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset) and includes an interactive web demo built with Gradio. " ] }, { "cell_type": "markdown", "id": "338759e6", "metadata": {}, "source": [ "## EDA" ] }, { "cell_type": "code", "execution_count": null, "id": "dcc5a23b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Loading events.csv...\n", "Data Head:\n", " timestamp visitorid event itemid transactionid\n", "0 1433221332117 257597 view 355908 NaN\n", "1 1433224214164 992329 view 248676 NaN\n", "2 1433221999827 111016 view 318965 NaN\n", "3 1433221955914 483717 view 253185 NaN\n", "4 1433221337106 951259 view 367447 NaN\n", "\n", "Data Info:\n", "\n", "RangeIndex: 2756101 entries, 0 to 2756100\n", "Data columns (total 5 columns):\n", " # Column Dtype \n", "--- ------ ----- \n", " 0 timestamp int64 \n", " 1 visitorid int64 \n", " 2 event object \n", " 3 itemid int64 \n", " 4 transactionid float64\n", "dtypes: float64(1), int64(3), object(1)\n", "memory usage: 105.1+ MB\n", "\n", "Missing Values:\n", "timestamp 0\n", "visitorid 0\n", "event 0\n", "itemid 0\n", "transactionid 2733644\n", "dtype: int64\n" ] } ], "source": [ "import pandas as pd\n", "import time\n", "from datetime import datetime\n", "\n", "# Define the path to your data folder\n", "DATA_FOLDER = 'data/'\n", "\n", "# Load the events data\n", "print(\"Loading events.csv...\")\n", "events_df = pd.read_csv(DATA_FOLDER + 'events.csv')\n", "\n", "# --- Initial Inspection ---\n", "\n", "# See the first few rows\n", "print(\"Data Head:\")\n", "print(events_df.head())\n", "\n", "# Get a summary of the dataframe (columns, data types, memory usage)\n", "print(\"\\nData Info:\")\n", "events_df.info()\n", "\n", "# Check for any missing values\n", "print(\"\\nMissing Values:\")\n", "print(events_df.isnull().sum())" ] }, { "cell_type": "code", "execution_count": 2, "id": "dd89bf40", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Data timeframe is from 2015-05-03 03:00:04.384000 to 2015-09-18 02:59:47.788000\n", "\n", "Event Counts:\n", "event\n", "view 2664312\n", "addtocart 69332\n", "transaction 22457\n", "Name: count, dtype: int64\n", "\n", "Number of unique visitors: 1407580\n", "Number of unique items: 235061\n" ] } ], "source": [ "# --- Data Cleaning and Understanding ---\n", "\n", "# 1. Convert timestamp to datetime\n", "# The timestamp is in milliseconds, so we divide by 1000\n", "events_df['timestamp_dt'] = pd.to_datetime(events_df['timestamp'], unit='ms')\n", "print(f\"\\nData timeframe is from {events_df['timestamp_dt'].min()} to {events_df['timestamp_dt'].max()}\")\n", "\n", "\n", "# 2. Analyze the distribution of event types\n", "print(\"\\nEvent Counts:\")\n", "event_counts = events_df['event'].value_counts()\n", "print(event_counts)\n", "\n", "\n", "# 3. Calculate number of unique users and items\n", "n_users = events_df['visitorid'].nunique()\n", "n_items = events_df['itemid'].nunique()\n", "\n", "print(f\"\\nNumber of unique visitors: {n_users}\")\n", "print(f\"Number of unique items: {n_items}\")" ] }, { "cell_type": "markdown", "id": "90fbfb19", "metadata": {}, "source": [ "## Preparing the data" ] }, { "cell_type": "code", "execution_count": null, "id": "8639638d", "metadata": {}, "outputs": [], "source": [ "import zipfile\n", "import pandas as pd\n", "from datetime import datetime, timedelta\n", "import numpy as np\n", "from scipy.sparse import csr_matrix\n", "import math\n", "import torch\n", "import torch.nn as nn\n", "from torch.utils.data import Dataset, DataLoader\n", "import pytorch_lightning as pl\n", "\n", "def prepare_data(data_folder='data/', val_days=7, test_days=7):\n", " \"\"\"\n", " Loads, preprocesses, and splits the events data into train, validation, and test sets.\n", " \n", " args:\n", " data_folder: str, path to the folder containing 'events.csv'\n", " val_days: int, number of days for the validation set\n", " test_days: int, number of days for the test set\n", " \"\"\"\n", " # --- Load Data ---\n", " print(f\"Loading events.csv from folder: {data_folder}\")\n", " try:\n", " events_df = pd.read_csv(data_folder + 'events.csv')\n", " print(\"Successfully loaded events.csv.\")\n", " events_df['timestamp_dt'] = pd.to_datetime(events_df['timestamp'], unit='ms')\n", " print(\"\\n--- Initial Data Summary ---\")\n", " print(f\"Data shape: {events_df.shape}\")\n", " print(f\"Full timeframe: {events_df['timestamp_dt'].min()} to {events_df['timestamp_dt'].max()}\")\n", " print(\"----------------------------\\n\")\n", " except FileNotFoundError:\n", " print(f\"Error: 'events.csv' not found in '{data_folder}'. Please check the path.\")\n", " return None, None, None\n", "\n", " # --- Split Data ---\n", " sorted_df = events_df.sort_values('timestamp_dt').reset_index(drop=True)\n", " print(f\"Splitting data: {test_days} days for test, {val_days} for validation.\")\n", " end_time = sorted_df['timestamp_dt'].max()\n", " test_start_time = end_time - timedelta(days=test_days)\n", " val_start_time = test_start_time - timedelta(days=val_days)\n", "\n", " test_df = sorted_df[sorted_df['timestamp_dt'] >= test_start_time]\n", " val_df = sorted_df[(sorted_df['timestamp_dt'] >= val_start_time) & (sorted_df['timestamp_dt'] < test_start_time)]\n", " train_df = sorted_df[sorted_df['timestamp_dt'] < val_start_time]\n", "\n", " print(\"--- Data Splitting Summary ---\")\n", " print(f\"Training set: {train_df.shape[0]:>8} records | from {train_df['timestamp_dt'].min()} to {train_df['timestamp_dt'].max()}\")\n", " print(f\"Validation set: {val_df.shape[0]:>8} records | from {val_df['timestamp_dt'].min()} to {val_df['timestamp_dt'].max()}\")\n", " print(f\"Test set: {test_df.shape[0]:>8} records | from {test_df['timestamp_dt'].min()} to {test_df['timestamp_dt'].max()}\")\n", " print(\"------------------------------\")\n", " \n", " return train_df, val_df, test_df\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f99e4498", "metadata": {}, "outputs": [], "source": [ "DATA_PATH = \"data\"\n", "\n", "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)" ] }, { "cell_type": "code", "execution_count": null, "id": "96b137cb", "metadata": {}, "outputs": [], "source": [ "class SASRecDataset(Dataset):\n", " \"\"\"\n", " SASRec Dataset.\n", " - Precomputes (sequence_id, cutoff_idx) pairs for O(1) __getitem__.\n", " - Supports 'last' or 'all' target modes.\n", " \"\"\"\n", " def __init__(self, sequences, max_len, target_mode=\"last\"):\n", " \"\"\"\n", " Args:\n", " sequences: list of user sequences (list of item IDs).\n", " max_len: maximum sequence length (padding applied).\n", " target_mode: 'last' (only last prediction) or 'all' (predict at every step).\n", " \"\"\"\n", " self.sequences = sequences\n", " self.max_len = max_len\n", " self.target_mode = target_mode\n", "\n", " # Build index once\n", " self.index = []\n", " for seq_id, seq in enumerate(sequences):\n", " for i in range(1, len(seq)):\n", " self.index.append((seq_id, i))\n", "\n", " def __len__(self):\n", " return len(self.index)\n", "\n", " def __getitem__(self, idx):\n", " seq_id, cutoff = self.index[idx]\n", " seq = self.sequences[seq_id][:cutoff]\n", "\n", " # Truncate & pad\n", " seq = seq[-self.max_len:]\n", " pad_len = self.max_len - len(seq)\n", "\n", " input_seq = np.zeros(self.max_len, dtype=np.int64)\n", " input_seq[pad_len:] = seq\n", "\n", " if self.target_mode == \"last\":\n", " target = self.sequences[seq_id][cutoff]\n", " return torch.LongTensor(input_seq), torch.LongTensor([target])\n", "\n", " elif self.target_mode == \"all\":\n", " # Predict next item at each step\n", " target_seq = self.sequences[seq_id][1:cutoff+1]\n", " target_seq = target_seq[-self.max_len:]\n", " target = np.zeros(self.max_len, dtype=np.int64)\n", " target[-len(target_seq):] = target_seq\n", " return torch.LongTensor(input_seq), torch.LongTensor(target)\n", "\n", "class SASRecDataModule(pl.LightningDataModule):\n", " \"\"\"\n", " PyTorch Lightning DataModule for preparing the RetailRocket dataset for the SASRec model.\n", "\n", " This class handles all aspects of data preparation, including:\n", " - Filtering out infrequent users and items to reduce noise.\n", " - Building a consistent item vocabulary.\n", " - Converting user event histories into sequential data.\n", " - Creating and providing `DataLoader` instances for training, validation, and testing.\n", " \"\"\"\n", " def __init__(self, train_df, val_df, test_df, min_item_interactions=5, \n", " min_user_interactions=5, max_len=50, batch_size=256):\n", " \"\"\"\n", " Initializes the DataModule.\n", "\n", " Args:\n", " train_df (pd.DataFrame): DataFrame for training.\n", " val_df (pd.DataFrame): DataFrame for validation.\n", " test_df (pd.DataFrame): DataFrame for testing.\n", " min_item_interactions (int): Minimum number of interactions for an item to be kept.\n", " min_user_interactions (int): Minimum number of interactions for a user to be kept.\n", " max_len (int): The maximum length of a user sequence fed to the model.\n", " batch_size (int): The batch size for the DataLoaders.\n", " \"\"\"\n", " super().__init__()\n", " self.train_df = train_df\n", " self.val_df = val_df\n", " self.test_df = test_df\n", " self.min_item_interactions = min_item_interactions\n", " self.min_user_interactions = min_user_interactions\n", " self.max_len = max_len\n", " self.batch_size = batch_size\n", "\n", " self.item_map = None\n", " self.inverse_item_map = None\n", " self.vocab_size = 0\n", " self.user_history = None\n", "\n", " def setup(self, stage=None):\n", " \"\"\"\n", " Prepares the data for training, validation, and testing.\n", "\n", " This method is called automatically by PyTorch Lightning. It performs the following steps:\n", " 1. Determines filtering criteria (which users and items to keep) based on the training set only\n", " to prevent data leakage.\n", " 2. Applies these filters to the train, validation, and test sets.\n", " 3. Builds an item vocabulary (mapping item IDs to integer indices) from the combined\n", " training and validation sets to ensure consistency for model checkpointing.\n", " 4. Converts the event logs into sequences of item indices for each user in each data split.\n", " \"\"\"\n", " item_counts = self.train_df['itemid'].value_counts()\n", " user_counts = self.train_df['visitorid'].value_counts()\n", " items_to_keep = item_counts[item_counts >= self.min_item_interactions].index\n", " users_to_keep = user_counts[user_counts >= self.min_user_interactions].index\n", "\n", " self.filtered_train_df = self.train_df[\n", " (self.train_df['itemid'].isin(items_to_keep)) & \n", " (self.train_df['visitorid'].isin(users_to_keep))\n", " ].copy()\n", " self.filtered_val_df = self.val_df[\n", " (self.val_df['itemid'].isin(items_to_keep)) & \n", " (self.val_df['visitorid'].isin(users_to_keep))\n", " ].copy()\n", " self.filtered_test_df = self.test_df[\n", " (self.test_df['itemid'].isin(items_to_keep)) & \n", " (self.test_df['visitorid'].isin(users_to_keep))\n", " ].copy()\n", "\n", " all_known_items_df = pd.concat([self.filtered_train_df, self.filtered_val_df])\n", " unique_items = all_known_items_df['itemid'].unique()\n", " self.item_map = {item_id: i + 1 for i, item_id in enumerate(unique_items)}\n", " self.inverse_item_map = {i: item_id for item_id, i in self.item_map.items()}\n", " self.vocab_size = len(self.item_map) + 1 # +1 for padding token 0\n", "\n", " self.user_history = self.filtered_train_df.groupby('visitorid')['itemid'].apply(list)\n", " \n", " self.train_sequences = self._create_sequences(self.filtered_train_df)\n", " self.val_sequences = self._create_sequences(self.filtered_val_df)\n", " self.test_sequences = self._create_sequences(self.filtered_test_df)\n", "\n", " def _create_sequences(self, df):\n", " \"\"\"\n", " Helper function to convert a DataFrame of events into user interaction sequences.\n", " \n", " Args:\n", " df (pd.DataFrame): The input DataFrame to process.\n", "\n", " Returns:\n", " list[list[int]]: A list of user sequences, where each sequence is a list of item indices.\n", " \"\"\"\n", " df_sorted = df.sort_values(['visitorid', 'timestamp_dt'])\n", " sequences = df_sorted.groupby('visitorid')['itemid'].apply(\n", " lambda x: [self.item_map[i] for i in x if i in self.item_map]\n", " ).tolist()\n", " return [s for s in sequences if len(s) > 1]\n", "\n", " def train_dataloader(self):\n", " \"\"\"Creates the DataLoader for the training set.\"\"\"\n", " dataset = SASRecDataset(self.train_sequences, self.max_len)\n", " return DataLoader(dataset, batch_size=self.batch_size, shuffle=True, num_workers=0)\n", "\n", " def val_dataloader(self):\n", " \"\"\"Creates the DataLoader for the validation set.\"\"\"\n", " dataset = SASRecDataset(self.val_sequences, self.max_len)\n", " return DataLoader(dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)\n", " \n", " def test_dataloader(self):\n", " \"\"\"Creates the DataLoader for the test set.\"\"\"\n", " dataset = SASRecDataset(self.test_sequences, self.max_len)\n", " return DataLoader(dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)" ] }, { "cell_type": "code", "execution_count": null, "id": "56bdc81c", "metadata": {}, "outputs": [], "source": [ "BATCH_SIZE = 256 \n", "MAX_TOKEN_LEN = 50 # 50–100 is standard for SASRec\n", "\n", "# --- 1. Prepare the data into train, validation, and test sets ---\n", "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)\n", "\n", "# --- 2. Initialize DataModule ---\n", "print(\"Initializing DataModule...\")\n", "datamodule = SASRecDataModule(\n", " train_df=train_set,\n", " val_df=validation_set,\n", " test_df=test_set,\n", " batch_size=BATCH_SIZE,\n", " max_len=MAX_TOKEN_LEN\n", ")\n", "datamodule.setup()" ] }, { "cell_type": "markdown", "id": "0529207a", "metadata": {}, "source": [ "## Define train and evaluate the base models " ] }, { "cell_type": "code", "execution_count": null, "id": "8d899a5a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from datetime import datetime, timedelta\n", "import numpy as np\n", "from scipy.sparse import csr_matrix\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "import implicit\n", "import torch\n", "import torch.nn as nn\n", "from torch.utils.data import Dataset, DataLoader\n", "import pytorch_lightning as pl\n", "\n", "# --- 1. Evaluation Helper Functions ---\n", "\n", "def prepare_ground_truth(df, mode=\"purchase\", event_weights=None):\n", " \"\"\"\n", " Prepares ground truth dictionaries for evaluation.\n", "\n", " Parameters\n", " ----------\n", " df : pd.DataFrame\n", " Test dataframe containing at least ['visitorid', 'itemid', 'event'].\n", " mode : str, default=\"purchase\"\n", " - \"purchase\" : Only use transactions as ground truth.\n", " - \"all\" : Use all events. Optionally weight them.\n", " event_weights : dict, optional\n", " Example: {\"view\": 1, \"addtocart\": 3, \"transaction\": 5}.\n", " Used only if mode == \"all\".\n", "\n", " Returns\n", " -------\n", " dict : {user_id: set of item_ids}\n", " \"\"\"\n", " if mode == \"purchase\":\n", " df_filtered = df[df[\"event\"] == \"transaction\"]\n", " ground_truth = df_filtered.groupby(\"visitorid\")[\"itemid\"].apply(set).to_dict()\n", "\n", " elif mode == \"all\":\n", " if event_weights is None:\n", " # Default: treat all events equally\n", " ground_truth = df.groupby(\"visitorid\")[\"itemid\"].apply(set).to_dict()\n", " else:\n", " # Weighted ground truth (for more advanced eval)\n", " ground_truth = {}\n", " for uid, user_df in df.groupby(\"visitorid\"):\n", " weighted_items = []\n", " for _, row in user_df.iterrows():\n", " weight = event_weights.get(row[\"event\"], 1)\n", " weighted_items.extend([row[\"itemid\"]] * weight)\n", " ground_truth[uid] = set(weighted_items)\n", " else:\n", " raise ValueError(\"mode must be 'purchase' or 'all'\")\n", "\n", " return ground_truth\n", "\n", "def calculate_metrics(recommendations_dict, ground_truth_dict, k):\n", " \"\"\"\n", " Calculates Precision@k, Recall@k, and HitRate@k.\n", "\n", " args:\n", " ----------\n", " recommendations_dict : {user_id: [recommended_item_ids]}\n", " ground_truth_dict : {user_id: set of ground truth item_ids}\n", " k : int\n", "\n", " Returns\n", " -------\n", " dict with mean precision, recall, and hit rate\n", " \"\"\"\n", " all_precisions, all_recalls, all_hits = [], [], []\n", "\n", " for user_id, true_items in ground_truth_dict.items():\n", " recs = recommendations_dict.get(user_id, [])[:k]\n", " if not true_items:\n", " continue\n", " hits = len(set(recs) & true_items)\n", "\n", " precision = hits / k if k > 0 else 0\n", " recall = hits / len(true_items)\n", " hit_rate = 1.0 if hits > 0 else 0.0\n", "\n", " all_precisions.append(precision)\n", " all_recalls.append(recall)\n", " all_hits.append(hit_rate)\n", "\n", " if not all_precisions:\n", " return {\"mean_precision@k\": 0, \"mean_recall@k\": 0, \"mean_hitrate@k\": 0}\n", "\n", " return {\n", " \"mean_precision@k\": np.mean(all_precisions),\n", " \"mean_recall@k\": np.mean(all_recalls),\n", " \"mean_hitrate@k\": np.mean(all_hits)\n", " }\n", "\n", "# --- 2. Model Functions (Popularity, Item-Item, ALS) ---\n", "\n", "def recommend_popular_items_and_evaluate(train_df, test_df, k=10, prepare_ground_truth=None, calculate_metrics=None):\n", " \"\"\"\n", " Trains a non-personalized Popularity model and evaluates its performance.\n", "\n", " This model recommends the top-k most frequently transacted items from the training\n", " set to every user. It serves as a simple but strong baseline.\n", "\n", " Args:\n", " train_df (pd.DataFrame): The training dataset.\n", " test_df (pd.DataFrame): The test dataset for evaluation.\n", " k (int): The number of items to recommend.\n", " prepare_ground_truth (function): A function to process the test_df into a ground truth dict.\n", " calculate_metrics (function): A function to compute ranking metrics.\n", "\n", " Returns:\n", " dict: A dictionary containing the calculated evaluation metrics (e.g., precision, recall).\n", " \"\"\"\n", " print(f\"\\n--- Evaluating Popularity Model (Top {k} items) ---\")\n", " \n", " # 1. \"Train\" the model by finding the most popular items based on transactions\n", " purchase_counts = train_df[train_df['event'] == 'transaction']['itemid'].value_counts()\n", " popular_items = purchase_counts.head(k).index.tolist()\n", " print(f\"Top {k} popular items identified from training data.\")\n", "\n", " # 2. Evaluate the model\n", " ground_truth = prepare_ground_truth(test_df)\n", " # Every user receives the same list of popular items\n", " recommendations = {user_id: popular_items for user_id in ground_truth.keys()}\n", " \n", " metrics = calculate_metrics(recommendations, ground_truth, k)\n", " print(\"Evaluation complete.\")\n", " return metrics\n", "\n", "def recommend_item_item_and_evaluate(train_df, test_df, k=10, min_item_interactions=5, min_user_interactions=5, prepare_ground_truth=None, calculate_metrics=None):\n", " \"\"\"\n", " Trains an Item-Item Collaborative Filtering model and evaluates its performance.\n", "\n", " This model recommends items that are similar to items a user has interacted\n", " with in the past, based on co-occurrence patterns in the training data.\n", "\n", " Args:\n", " train_df (pd.DataFrame): The training dataset.\n", " test_df (pd.DataFrame): The test dataset for evaluation.\n", " k (int): The number of items to recommend.\n", " min_item_interactions (int): Minimum number of interactions for an item to be kept.\n", " min_user_interactions (int): Minimum number of interactions for a user to be kept.\n", " prepare_ground_truth (function): A function to process the test_df into a ground truth dict.\n", " calculate_metrics (function): A function to compute ranking metrics.\n", "\n", " Returns:\n", " dict: A dictionary containing the calculated evaluation metrics.\n", " \"\"\"\n", " print(f\"\\n--- Evaluating Item-Item CF Model (Top {k} items) ---\")\n", " \n", " # 1. Filter out infrequent users and items to reduce noise and computation\n", " item_counts = train_df['itemid'].value_counts()\n", " user_counts = train_df['visitorid'].value_counts()\n", " items_to_keep = item_counts[item_counts >= min_item_interactions].index\n", " users_to_keep = user_counts[user_counts >= min_user_interactions].index\n", " filtered_df = train_df[(train_df['itemid'].isin(items_to_keep)) & (train_df['visitorid'].isin(users_to_keep))].copy()\n", " print(f\"Filtered training data from {len(train_df)} to {len(filtered_df)} records.\")\n", "\n", " # 2. Create user-item interaction matrix and vocabulary mappings\n", " user_map = {uid: i for i, uid in enumerate(filtered_df['visitorid'].unique())}\n", " item_map = {iid: i for i, iid in enumerate(filtered_df['itemid'].unique())}\n", " inverse_item_map = {i: iid for iid, i in item_map.items()}\n", " user_indices = filtered_df['visitorid'].map(user_map)\n", " item_indices = filtered_df['itemid'].map(item_map)\n", " user_item_matrix = csr_matrix((np.ones(len(filtered_df)), (user_indices, item_indices)))\n", "\n", " # 3. Calculate the cosine similarity matrix between all items\n", " print(\"Calculating item similarity matrix...\")\n", " item_similarity_matrix = cosine_similarity(user_item_matrix.T, dense_output=False)\n", " print(\"Similarity matrix calculated.\")\n", "\n", " # 4. Generate recommendations and evaluate\n", " ground_truth = prepare_ground_truth(test_df)\n", " recommendations = {}\n", " print(\"Generating recommendations for users in test set...\")\n", " test_users = [u for u in ground_truth.keys() if u in user_map]\n", " \n", " for user_id in test_users:\n", " user_index = user_map[user_id]\n", " user_interactions_indices = user_item_matrix[user_index].indices\n", " \n", " if len(user_interactions_indices) > 0:\n", " # Aggregate scores from items the user has interacted with\n", " all_scores = np.asarray(item_similarity_matrix[user_interactions_indices].sum(axis=0)).flatten()\n", " # Remove already interacted items from recommendations\n", " all_scores[user_interactions_indices] = -1\n", " top_indices = np.argsort(all_scores)[::-1][:k]\n", " recs = [inverse_item_map[idx] for idx in top_indices if idx in inverse_item_map]\n", " recommendations[user_id] = recs\n", " \n", " metrics = calculate_metrics(recommendations, ground_truth, k)\n", " print(\"Evaluation complete.\")\n", " return metrics\n", "\n", "def recommend_als_and_evaluate(train_df, test_df, k=10, min_item_interactions=5, min_user_interactions=5, \n", " factors=25, regularization=0.02, iterations=48, prepare_ground_truth=None, calculate_metrics=None):\n", " \"\"\"\n", " Trains an Alternating Least Squares (ALS) model and evaluates its performance.\n", "\n", " This model uses matrix factorization to learn latent embeddings for users and\n", " items from implicit feedback data. Default hyperparameters are set from a\n", " previous Optuna tuning process.\n", "\n", " Args:\n", " train_df (pd.DataFrame): The training dataset.\n", " test_df (pd.DataFrame): The test dataset for evaluation.\n", " k (int): The number of items to recommend.\n", " min_item_interactions (int): Minimum number of interactions for an item to be kept.\n", " min_user_interactions (int): Minimum number of interactions for a user to be kept.\n", " factors (int): The number of latent factors to compute.\n", " regularization (float): The regularization factor.\n", " iterations (int): The number of ALS iterations to run.\n", " prepare_ground_truth (function): A function to process the test_df into a ground truth dict.\n", " calculate_metrics (function): A function to compute ranking metrics.\n", "\n", " Returns:\n", " dict: A dictionary containing the calculated evaluation metrics.\n", " \"\"\"\n", " print(f\"\\n--- Evaluating ALS Model (Top {k} items) ---\")\n", " \n", " # 1. Filter data\n", " item_counts = train_df['itemid'].value_counts()\n", " user_counts = train_df['visitorid'].value_counts()\n", " items_to_keep = item_counts[item_counts >= min_item_interactions].index\n", " users_to_keep = user_counts[user_counts >= min_user_interactions].index\n", " filtered_df = train_df[(train_df['itemid'].isin(items_to_keep)) & (train_df['visitorid'].isin(users_to_keep))].copy()\n", " print(f\"Filtered training data from {len(train_df)} to {len(filtered_df)} records.\")\n", "\n", " # 2. Create mappings and confidence matrix\n", " user_map = {uid: i for i, uid in enumerate(filtered_df['visitorid'].unique())}\n", " item_map = {iid: i for i, iid in enumerate(filtered_df['itemid'].unique())}\n", " inverse_item_map = {i: iid for iid, i in item_map.items()}\n", " user_indices = filtered_df['visitorid'].map(user_map).astype(np.int32)\n", " item_indices = filtered_df['itemid'].map(item_map).astype(np.int32)\n", " \n", " event_weights = {'view': 1, 'addtocart': 3, 'transaction': 5}\n", " confidence = filtered_df['event'].map(event_weights).astype(np.float32)\n", " user_item_matrix = csr_matrix((confidence, (user_indices, item_indices)))\n", "\n", " # 3. Train the ALS model\n", " print(\"Training ALS model...\")\n", " als_model = implicit.als.AlternatingLeastSquares(factors=factors, regularization=regularization, iterations=iterations)\n", " als_model.fit(user_item_matrix)\n", " print(\"ALS model trained.\")\n", "\n", " # 4. Generate recommendations and evaluate\n", " ground_truth = prepare_ground_truth(test_df)\n", " recommendations = {}\n", " print(\"Generating recommendations for users in test set...\")\n", " test_users_indices = [user_map[u] for u in ground_truth.keys() if u in user_map]\n", " \n", " if test_users_indices:\n", " user_item_matrix_for_recs = user_item_matrix[test_users_indices]\n", " ids, _ = als_model.recommend(test_users_indices, user_item_matrix_for_recs, N=k)\n", " \n", " for i, user_index in enumerate(test_users_indices):\n", " original_user_id = list(user_map.keys())[list(user_map.values()).index(user_index)]\n", " recs = [inverse_item_map[item_idx] for item_idx in ids[i] if item_idx in inverse_item_map]\n", " recommendations[original_user_id] = recs\n", " \n", " metrics = calculate_metrics(recommendations, ground_truth, k)\n", " print(\"Evaluation complete.\")\n", " return metrics\n", "\n", "\n", " train_set, validation_set, test_set = prepare_data(data_folder='C:/Users/dania/vsproject/projects/recommernder_system/data/')\n", " if train_set is not None:\n", " results = {}\n", " full_train_set = pd.concat([train_set, validation_set])\n", " \n", "# # Evaluate classical models\n", " print(\"\\n>>> Running evaluations on the VALIDATION set <<<\")\n", " results['Popularity (Validation)'] = recommend_popular_items_and_evaluate(train_set, validation_set)\n", " results['Item-Item CF (Validation)'] = recommend_item_item_and_evaluate(train_set, validation_set)\n", " results['ALS (Validation)'] = recommend_als_and_evaluate(train_set, validation_set)\n", " \n", " print(\"\\n>>> Running final evaluations on the TEST set <<<\")\n", " results['Popularity (Test)'] = recommend_popular_items_and_evaluate(full_train_set, test_set)\n", " results['Item-Item CF (Test)'] = recommend_item_item_and_evaluate(full_train_set, test_set)\n", " results['ALS (Test)'] = recommend_als_and_evaluate(full_train_set, test_set)\n", " \n", " print(\"\\n--- Final Evaluation Results ---\")\n", " results_df = pd.DataFrame.from_dict(results, orient='index')\n", " print(results_df)\n", " print(\"--------------------------------\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "b978c458", "metadata": {}, "outputs": [], "source": [ "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)\n", "if train_set is not None:\n", " results = {}\n", " full_train_set = pd.concat([train_set, validation_set])\n", " \n", " # Evaluate base models\n", " print(\"\\n>>> Running evaluations on the VALIDATION set <<<\")\n", " results['Popularity (Validation)'] = recommend_popular_items_and_evaluate(train_set, validation_set)\n", " results['Item-Item CF (Validation)'] = recommend_item_item_and_evaluate(train_set, validation_set)\n", " results['ALS (Validation)'] = recommend_als_and_evaluate(train_set, validation_set)\n", " \n", " print(\"\\n>>> Running final evaluations on the TEST set <<<\")\n", " results['Popularity (Test)'] = recommend_popular_items_and_evaluate(full_train_set, test_set)\n", " results['Item-Item CF (Test)'] = recommend_item_item_and_evaluate(full_train_set, test_set)\n", " results['ALS (Test)'] = recommend_als_and_evaluate(full_train_set, test_set)\n", " \n", " print(\"\\n--- Final Evaluation Results ---\")\n", " results_df = pd.DataFrame.from_dict(results, orient='index')\n", " print(results_df)\n", " print(\"--------------------------------\")" ] }, { "cell_type": "markdown", "id": "85d8f78c", "metadata": {}, "source": [ "## Use Optuna to find the best Hyperparameters for the ALS model" ] }, { "cell_type": "code", "execution_count": null, "id": "202be1f4", "metadata": {}, "outputs": [], "source": [ "import optuna\n", "\n", "def objective_als(trial, train_df, val_df):\n", " \"\"\"\n", " The objective function for Optuna to optimize.\n", " \"\"\"\n", " # 1. Define the hyperparameter search space\n", " params = {\n", " 'factors': trial.suggest_int('factors', 20, 200),\n", " 'regularization': trial.suggest_float('regularization', 1e-3, 1e-1, log=True),\n", " 'iterations': trial.suggest_int('iterations', 10, 50)\n", " }\n", " \n", " # 2. Run an evaluation with the suggested parameters\n", " metrics = recommend_als_and_evaluate(train_df, val_df, **params)\n", " \n", " # 3. Return the metric we want to maximize (precision)\n", " return metrics['mean_precision@k']\n", "\n", "def tune_als_hyperparameters(train_df, val_df, n_trials=25):\n", " \"\"\"\n", " Orchestrates the Optuna study to find the best hyperparameters for ALS.\n", " \"\"\"\n", " study = optuna.create_study(direction='maximize')\n", " study.optimize(lambda trial: objective_als(trial, train_df, val_df), n_trials=n_trials)\n", " \n", " print(\"\\n--- Optuna Study Complete ---\")\n", " print(f\"Number of finished trials: {len(study.trials)}\")\n", " print(\"Best trial:\")\n", " trial = study.best_trial\n", " print(f\" Value (Precision@10): {trial.value}\")\n", " print(\" Params: \")\n", " for key, value in trial.params.items():\n", " print(f\" {key}: {value}\")\n", " \n", " return trial.params\n" ] }, { "cell_type": "code", "execution_count": null, "id": "18d48b2e", "metadata": {}, "outputs": [], "source": [ "# 1. Prepare all data\n", "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)\n", "\n", "\n", "# --- Hyperparameter Tuning Step ---\n", "print(\"\\n>>> 1. TUNING ALS Hyperparameters on the VALIDATION set <<<\")\n", "# You can increase n_trials for a more thorough search, e.g., to 50 or 100\n", "best_als_params = tune_als_hyperparameters(train_set, validation_set, n_trials=25) \n" ] }, { "cell_type": "markdown", "id": "d9bc9ef8", "metadata": {}, "source": [ "## Define train and evaluate the SASRec model" ] }, { "cell_type": "code", "execution_count": null, "id": "4a90d635", "metadata": {}, "outputs": [], "source": [ "class SASRec(pl.LightningModule):\n", " \"\"\"\n", " A PyTorch Lightning implementation of the SASRec model for sequential recommendation.\n", "\n", " SASRec (Self-Attentive Sequential Recommendation) uses a Transformer-based\n", " architecture to capture the sequential patterns in a user's interaction history\n", " to predict the next item they are likely to interact with.\n", "\n", " Attributes:\n", " save_hyperparameters: Automatically saves all constructor arguments as hyperparameters.\n", " item_embedding (nn.Embedding): Embedding layer for item IDs.\n", " positional_embedding (nn.Embedding): Embedding layer to encode the position of items in a sequence.\n", " transformer_encoder (nn.TransformerEncoder): The core self-attention module.\n", " fc (nn.Linear): Final fully connected layer to produce logits over the item vocabulary.\n", " loss_fn (nn.CrossEntropyLoss): The loss function used for training.\n", " \"\"\"\n", " def __init__(self, vocab_size, max_len, hidden_dim, num_heads, num_layers,\n", " dropout=0.2, learning_rate=1e-3, weight_decay=1e-6, warmup_steps=2000, max_steps=100000):\n", " \"\"\"\n", " Initializes the SASRec model layers and hyperparameters.\n", "\n", " Args:\n", " vocab_size (int): The total number of unique items in the dataset (+1 for padding).\n", " max_len (int): The maximum length of the input sequences.\n", " hidden_dim (int): The dimensionality of the item and positional embeddings.\n", " num_heads (int): The number of attention heads in the Transformer encoder.\n", " num_layers (int): The number of layers in the Transformer encoder.\n", " dropout (float): The dropout rate to be applied.\n", " learning_rate (float): The learning rate for the optimizer.\n", " weight_decay (float): The weight decay (L2 penalty) for the optimizer.\n", " warmup_steps (int): The number of linear warmup steps for the learning rate scheduler.\n", " max_steps (int): The total number of training steps for the learning rate scheduler's decay phase.\n", " \"\"\"\n", " super().__init__()\n", " # This saves all hyperparameters to self.hparams, making them accessible later\n", " self.save_hyperparameters()\n", "\n", " # Embedding layers\n", " self.item_embedding = nn.Embedding(vocab_size, hidden_dim, padding_idx=0)\n", " self.positional_embedding = nn.Embedding(max_len, hidden_dim)\n", " self.dropout = nn.Dropout(dropout)\n", "\n", " # Transformer Encoder\n", " encoder_layer = nn.TransformerEncoderLayer(\n", " d_model=hidden_dim, nhead=num_heads, dim_feedforward=hidden_dim * 4,\n", " dropout=dropout, batch_first=True, activation='gelu'\n", " )\n", " self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)\n", "\n", " # Output layer\n", " self.fc = nn.Linear(hidden_dim, vocab_size)\n", "\n", " # Loss function, ignoring the padding token\n", " self.loss_fn = nn.CrossEntropyLoss(ignore_index=0)\n", " \n", " # Lists to store outputs from validation and test steps\n", " self.validation_step_outputs = []\n", " self.test_step_outputs = []\n", "\n", " def forward(self, x):\n", " \"\"\"\n", " Defines the forward pass of the model.\n", "\n", " Args:\n", " x (torch.Tensor): A batch of input sequences of shape (batch_size, seq_len).\n", "\n", " Returns:\n", " torch.Tensor: The output logits of shape (batch_size, seq_len, vocab_size).\n", " \"\"\"\n", " seq_len = x.size(1)\n", " # Create positional indices (0, 1, 2, ..., seq_len-1)\n", " positions = torch.arange(seq_len, device=self.device).unsqueeze(0)\n", "\n", " # Create a causal mask to ensure the model doesn't look ahead in the sequence\n", " causal_mask = nn.Transformer.generate_square_subsequent_mask(seq_len, device=self.device)\n", "\n", " # Combine item and positional embeddings\n", " x = self.item_embedding(x) + self.positional_embedding(positions)\n", " x = self.dropout(x)\n", " \n", " # Pass through the Transformer encoder\n", " x = self.transformer_encoder(x, mask=causal_mask)\n", " \n", " # Get final logits\n", " logits = self.fc(x)\n", " return logits\n", "\n", " def training_step(self, batch, batch_idx):\n", " \"\"\"\n", " Performs a single training step.\n", "\n", " Args:\n", " batch (tuple): A tuple containing input sequences and target items.\n", " batch_idx (int): The index of the current batch.\n", "\n", " Returns:\n", " torch.Tensor: The calculated loss for the batch.\n", " \"\"\"\n", " inputs, targets = batch\n", " logits = self.forward(inputs)\n", "\n", " # We only care about the prediction for the very last item in the input sequence\n", " last_logits = logits[:, -1, :]\n", " \n", " # Calculate loss against the single target item\n", " loss = self.loss_fn(last_logits, targets.squeeze())\n", " \n", " self.log('train_loss', loss, prog_bar=True, on_step=True, on_epoch=True)\n", " return loss\n", "\n", " def validation_step(self, batch, batch_idx):\n", " \"\"\"\n", " Performs a single validation step.\n", " Calculates loss and stores predictions for metric computation at the end of the epoch.\n", " \"\"\"\n", " inputs, targets = batch\n", " logits = self.forward(inputs)\n", " last_item_logits = logits[:, -1, :]\n", " loss = self.loss_fn(last_item_logits, targets.squeeze())\n", " self.log('val_loss', loss, prog_bar=True, on_epoch=True)\n", "\n", " # Get top-10 predictions for metric calculation\n", " top_k_preds = torch.topk(last_item_logits, 10, dim=-1).indices\n", " self.validation_step_outputs.append({'preds': top_k_preds, 'targets': targets})\n", " return loss\n", "\n", " def on_validation_epoch_end(self):\n", " \"\"\"\n", " Calculates and logs ranking metrics at the end of the validation epoch.\n", " \"\"\"\n", " if not self.validation_step_outputs: return\n", "\n", " # Concatenate all predictions and targets from the epoch\n", " preds = torch.cat([x['preds'] for x in self.validation_step_outputs], dim=0)\n", " targets = torch.cat([x['targets'] for x in self.validation_step_outputs], dim=0)\n", "\n", " k = preds.size(1)\n", " # Check if the target is in the top-k predictions for each example\n", " hits_tensor = (preds == targets).any(dim=1)\n", " num_hits = hits_tensor.sum().item()\n", " num_targets = len(targets)\n", "\n", " if num_targets > 0:\n", " hit_rate = num_hits / num_targets\n", " recall = hit_rate # For next-item prediction, recall@k is the same as hit_rate@k\n", " precision = num_hits / (k * num_targets)\n", " else:\n", " hit_rate, recall, precision = 0.0, 0.0, 0.0\n", "\n", " self.log('val_hitrate@10', hit_rate, prog_bar=True)\n", " self.log('val_precision@10', precision, prog_bar=True)\n", " self.log('val_recall@10', recall, prog_bar=True)\n", "\n", " self.validation_step_outputs.clear() # Free up memory\n", "\n", " def test_step(self, batch, batch_idx):\n", " \"\"\"\n", " Performs a single test step.\n", " Mirrors the logic of the validation_step.\n", " \"\"\"\n", " inputs, targets = batch\n", " logits = self.forward(inputs)\n", " last_item_logits = logits[:, -1, :]\n", " loss = self.loss_fn(last_item_logits, targets.squeeze())\n", " self.log('test_loss', loss, prog_bar=True)\n", "\n", " top_k_preds = torch.topk(last_item_logits, 10, dim=-1).indices\n", " self.test_step_outputs.append({'preds': top_k_preds, 'targets': targets})\n", " return loss\n", "\n", " def on_test_epoch_end(self):\n", " \"\"\"\n", " Calculates and logs ranking metrics at the end of the test epoch.\n", " \"\"\"\n", " if not self.test_step_outputs: return\n", "\n", " preds = torch.cat([x['preds'] for x in self.test_step_outputs], dim=0)\n", " targets = torch.cat([x['targets'] for x in self.test_step_outputs], dim=0)\n", "\n", " k = preds.size(1)\n", " hits_tensor = (preds == targets).any(dim=1)\n", " num_hits = hits_tensor.sum().item()\n", " num_targets = len(targets)\n", "\n", " if num_targets > 0:\n", " hit_rate = num_hits / num_targets\n", " recall = hit_rate\n", " precision = num_hits / (k * num_targets)\n", " else:\n", " hit_rate, recall, precision = 0.0, 0.0, 0.0\n", "\n", " self.log('test_hitrate@10', hit_rate, prog_bar=True)\n", " self.log('test_precision@10', precision, prog_bar=True)\n", " self.log('test_recall@10', recall, prog_bar=True)\n", "\n", " self.test_step_outputs.clear() # Free up memory\n", "\n", " def configure_optimizers(self):\n", " \"\"\"\n", " Configures the optimizer and learning rate scheduler.\n", " \n", " Uses AdamW optimizer and a linear warmup followed by a cosine decay schedule,\n", " which is a standard practice for training Transformer models.\n", " \"\"\"\n", " optimizer = torch.optim.AdamW(\n", " self.parameters(),\n", " lr=self.hparams.learning_rate,\n", " weight_decay=self.hparams.weight_decay\n", " )\n", " \n", " # Learning rate scheduler: linear warmup and cosine decay\n", " def lr_lambda(current_step: int):\n", " warmup_steps = self.hparams.warmup_steps\n", " max_steps = self.hparams.max_steps\n", " if current_step < warmup_steps:\n", " return float(current_step) / float(max(1, warmup_steps))\n", " progress = float(current_step - warmup_steps) / float(max(1, max_steps - warmup_steps))\n", " return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))\n", "\n", " scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)\n", "\n", " return {\n", " \"optimizer\": optimizer,\n", " \"lr_scheduler\": {\n", " \"scheduler\": scheduler,\n", " \"interval\": \"step\", # Update the scheduler at every training step\n", " \"frequency\": 1\n", " }\n", " }" ] }, { "cell_type": "code", "execution_count": null, "id": "20bbc93a", "metadata": {}, "outputs": [], "source": [ "import pytorch_lightning as pl\n", "from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping\n", "from pytorch_lightning.loggers import TensorBoardLogger\n", "import torch\n", "\n", "def train_and_eval_SASRec_model(train_set, validation_set, test_set, checkpoint_dir_path='checkpoints/',\n", " checkpoint_path=None, n_epochs=10, mode='train',\n", " batchsize=256, max_token_len=50, learning_rate=1e-3, hidden_dim=128,\n", " num_heads=2, num_layers=2, dropout=0.2, weight_decay=1e-6):\n", " \"\"\"\n", " Train or evaluate a SASRec sequential recommendation model using PyTorch Lightning.\n", "\n", " This function wraps the entire SASRec pipeline:\n", " - Initializes the SASRecDataModule (handles dataset preprocessing and dataloaders).\n", " - Builds the SASRec Transformer-based model.\n", " - Configures training callbacks (checkpointing, early stopping, LR monitoring).\n", " - Runs either training (`mode='train'`) or evaluation on the test set (`mode='test'`).\n", "\n", " Args\n", " ----------\n", " train_set : pd.DataFrame\n", " Training interactions dataset .\n", " validation_set : pd.DataFrame\n", " Validation dataset with the same structure as `train_set`.\n", " test_set : pd.DataFrame\n", " Test dataset with the same structure as `train_set`.\n", " checkpoint_dir_path : str, optional (default='checkpoints/')\n", " Directory to save model checkpoints.\n", " checkpoint_path : str or None, optional (default=None)\n", " Path to a checkpoint file for resuming training or loading a pretrained model for testing.\n", " n_epochs : int, optional (default=10)\n", " Number of training epochs.\n", " mode : {'train', 'test'}, optional (default='train')\n", " - `'train'`: trains the model on the training/validation data.\n", " - `'test'`: evaluates the model on the test set using a checkpoint.\n", " batchsize : int, optional (default=256)\n", " Batch size for training and evaluation.\n", " max_token_len : int, optional (default=50)\n", " Maximum sequence length per user (recent interactions kept).\n", " learning_rate : float, optional (default=1e-3)\n", " Learning rate for the AdamW optimizer.\n", " hidden_dim : int, optional (default=128)\n", " Dimensionality of item and positional embeddings.\n", " num_heads : int, optional (default=2)\n", " Number of attention heads in each Transformer encoder layer.\n", " num_layers : int, optional (default=2)\n", " Number of Transformer encoder layers.\n", " dropout : float, optional (default=0.2)\n", " Dropout probability applied in embeddings and Transformer layers.\n", " weight_decay : float, optional (default=1e-6)\n", " Weight decay regularization coefficient for AdamW.\n", " \"\"\"\n", " # --- 1. Initialize DataModule ---\n", " print(\"Initializing DataModule...\")\n", " datamodule = SASRecDataModule(\n", " train_df=train_set,\n", " val_df=validation_set,\n", " test_df=test_set,\n", " batch_size=batchsize,\n", " max_len=max_token_len\n", " )\n", " datamodule.setup()\n", "\n", " # --- 2. Initialize Model ---\n", " print(\"Initializing SASRec model...\")\n", " model = SASRec(\n", " vocab_size=datamodule.vocab_size,\n", " max_len=max_token_len,\n", " hidden_dim=hidden_dim,\n", " num_heads=num_heads,\n", " num_layers=num_layers,\n", " dropout=dropout,\n", " learning_rate=learning_rate,\n", " weight_decay=weight_decay\n", " )\n", "\n", " # --- 3. Configure Training Callbacks ---\n", " checkpoint_callback = ModelCheckpoint(\n", " dirpath=checkpoint_dir_path,\n", " filename=\"sasrec-{epoch:02d}-{val_hitrate@10:.4f}\",\n", " save_top_k=1,\n", " verbose=True,\n", " monitor=\"val_hitrate@10\",\n", " mode=\"max\"\n", " )\n", "\n", " early_stopping_callback = EarlyStopping(\n", " monitor=\"val_hitrate@10\", # stop if ranking metric stagnates\n", " patience=5,\n", " mode=\"max\"\n", " )\n", "\n", " lr_monitor = LearningRateMonitor(logging_interval=\"step\")\n", "\n", " logger = TensorBoardLogger(\"lightning_logs\", name=\"sasrec\")\n", "\n", " # --- 4. Initialize Trainer ---\n", " print(\"Initializing PyTorch Lightning Trainer...\")\n", " trainer = pl.Trainer(\n", " logger=logger,\n", " callbacks=[checkpoint_callback, early_stopping_callback, lr_monitor],\n", " max_epochs=n_epochs,\n", " accelerator='auto',\n", " devices=1,\n", " gradient_clip_val=1.0, # helps with exploding gradients\n", " )\n", "\n", " if mode == 'train' :\n", " # --- 5. Start Training ---\n", " print(f\"Starting training for up to {n_epochs} epochs...\")\n", " trainer.fit(model, datamodule,\n", " ckpt_path=checkpoint_path\n", " )\n", "\n", " elif mode == 'test':\n", " # --- 6. Test on best checkpoint ---\n", " print(\"Evaluating on test set...\")\n", " trainer.test(model, datamodule,\n", " ckpt_path=checkpoint_path\n", " )\n" ] }, { "cell_type": "code", "execution_count": null, "id": "5d4a2a7b", "metadata": {}, "outputs": [], "source": [ "# --- Configuration ---\n", "BATCH_SIZE = 256\n", "MAX_TOKEN_LEN = 50 # 50–100 is standard\n", "LEARNING_RATE = 1e-3\n", "HIDDEN_DIM = 128\n", "NUM_HEADS = 2\n", "NUM_LAYERS = 2\n", "DROPOUT = 0.2\n", "WEIGHT_DECAY = 1e-6\n", "N_EPOCHS = 50\n", "MODE = 'train' # 'train' or 'test'\n", "\n", "# Train and evaluate SASRec model\n", "print(\"\\n>>> Training and evaluating SASRec model <<<\")\n", "train_and_eval_SASRec_model(train_set, validation_set, test_set, n_epochs=10, mode='train')\n", "\n", "print(\"\\n>>> Evaluating trained SASRec model on TEST set <<<\")\n", "train_and_eval_SASRec_model(train_set, validation_set, test_set, mode='test')" ] }, { "cell_type": "markdown", "id": "468e0951", "metadata": {}, "source": [ "## Main function to run the complete Recommender System" ] }, { "cell_type": "code", "execution_count": null, "id": "8f810e9a", "metadata": {}, "outputs": [], "source": [ "def load_item_properties(data_folder='data/'):\n", " \"\"\"\n", " Loads item properties and creates a mapping from item ID to its category ID.\n", " Handles both a single properties file or two split parts.\n", " \n", " Args:\n", " data_folder (str): The path to the folder containing item property files.\n", "\n", " Returns:\n", " dict: A dictionary mapping {itemid: categoryid}.\n", " \"\"\"\n", " print(\"Loading item properties...\")\n", " try:\n", " # First, try to load the two separate parts and combine them.\n", " props_df_part1 = pd.read_csv(data_folder + 'item_properties_part1.csv')\n", " props_df_part2 = pd.read_csv(data_folder + 'item_properties_part2.csv')\n", " props_df = pd.concat([props_df_part1, props_df_part2], ignore_index=True)\n", " print(\"Successfully loaded and combined item_properties_part1.csv and item_properties_part2.csv.\")\n", "\n", " except FileNotFoundError:\n", " try:\n", " # If the parts are not found, try to load a single combined file.\n", " props_df = pd.read_csv(data_folder + 'item_properties.csv')\n", " print(\"Successfully loaded a single item_properties.csv.\")\n", " except FileNotFoundError:\n", " print(f\"Warning: No item properties files found. Cannot display category information.\")\n", " return {}\n", "\n", " category_df = props_df[props_df['property'] == 'categoryid'].copy()\n", " category_df['value'] = pd.to_numeric(category_df['value'], errors='coerce').astype('Int64')\n", " item_to_category_map = category_df.set_index('itemid')['value'].to_dict()\n", " print(\"Item to category mapping created successfully.\")\n", " return item_to_category_map\n", "\n", "def load_category_tree(data_folder='data/'):\n", " \"\"\"\n", " Loads the category tree to map categories to their parent categories.\n", "\n", " Args:\n", " data_folder (str): The path to the folder containing category_tree.csv.\n", "\n", " Returns:\n", " dict: A dictionary mapping {categoryid: parentid}.\n", " \"\"\"\n", " print(\"Loading category tree...\")\n", " try:\n", " tree_df = pd.read_csv(data_folder + 'category_tree.csv')\n", " category_parent_map = tree_df.set_index('categoryid')['parentid'].to_dict()\n", " print(\"Category tree loaded successfully.\")\n", " return category_parent_map\n", " except FileNotFoundError:\n", " print(\"Warning: 'category_tree.csv' not found. Cannot display parent category information.\")\n", " return {}\n", "\n", "def get_popular_items(train_df, k=10):\n", " \"\"\"\n", " Calculates the top-k most popular items based on transaction count.\n", " \"\"\"\n", " purchase_counts = train_df[train_df['event'] == 'transaction']['itemid'].value_counts()\n", " return purchase_counts.head(k).index.tolist()\n", "\n", "def show_user_recommendations(visitor_id, model, datamodule, popular_items, item_category_map, category_parent_map, k=10):\n", " \"\"\"\n", " Displays recommendations for a user, including category and parent category information.\n", " \"\"\"\n", " print(f\"\\n--- Recommendations for Visitor ID: {visitor_id} ---\")\n", " model.eval()\n", "\n", " def format_item_with_category(item_id):\n", " category_id = item_category_map.get(item_id, 'N/A')\n", " parent_id = category_parent_map.get(category_id, 'N/A') if category_id != 'N/A' else 'N/A'\n", " return f\"Item: {item_id} (Category: {category_id}, Parent: {parent_id})\"\n", "\n", " user_history_ids = datamodule.user_history.get(visitor_id)\n", "\n", " if user_history_ids is None:\n", " print(f\"User {visitor_id} not found in training history. Providing popularity-based recommendations.\")\n", " print(f\"\\nTop {k} Popular Items (Fallback):\")\n", " recs_with_cats = [format_item_with_category(item_id) for item_id in popular_items]\n", " print(recs_with_cats)\n", " print(\"-------------------------------------------------\")\n", " return\n", "\n", " history_with_cats = [format_item_with_category(item_id) for item_id in user_history_ids]\n", " print(f\"User's Historical Interactions:\")\n", " print(history_with_cats)\n", "\n", " history_indices = [datamodule.item_map[i] for i in user_history_ids if i in datamodule.item_map]\n", " if not history_indices:\n", " print(\"None of the user's historical items are in the model's vocabulary.\")\n", " return\n", "\n", " max_len = datamodule.max_len\n", " input_seq = history_indices[-max_len:]\n", " padded_input = np.zeros(max_len, dtype=np.int64)\n", " padded_input[-len(input_seq):] = input_seq\n", " \n", " input_tensor = torch.LongTensor(np.array([padded_input]))\n", " input_tensor = input_tensor.to(model.device)\n", "\n", " with torch.no_grad():\n", " logits = model(input_tensor)\n", " last_item_logits = logits[0, -1, :]\n", " top_indices = torch.topk(last_item_logits, k).indices.tolist()\n", "\n", " recommended_item_ids = [datamodule.inverse_item_map[idx] for idx in top_indices if idx in datamodule.inverse_item_map]\n", "\n", " print(f\"\\nTop {k} Recommended Items:\")\n", " recs_with_cats = [format_item_with_category(item_id) for item_id in recommended_item_ids]\n", " print(recs_with_cats)\n", " print(\"-------------------------------------------------\")\n" ] }, { "cell_type": "code", "execution_count": null, "id": "735d0f8d", "metadata": {}, "outputs": [], "source": [ "def main(checkpoint_path=\"checkpoints/sasrec-epoch=06-val_hitrate@10=0.3629.ckpt\", data_folder=\"data/\"):\n", " \"\"\"\n", " Main function to run the inference and qualitative analysis pipeline.\n", " \"\"\"\n", "\n", " device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", " print(f\"Using device: {device}\")\n", "\n", " print(\"Loading model from checkpoint...\")\n", " best_model = SASRec.load_from_checkpoint(checkpoint_path)\n", " best_model.to(device)\n", "\n", " print(\"Preparing data...\")\n", " train_set, validation_set, test_set = prepare_data(data_folder=data_folder)\n", " \n", " datamodule = SASRecDataModule(train_set, validation_set, test_set)\n", " datamodule.setup()\n", " \n", " item_category_map = load_item_properties(data_folder=data_folder)\n", " category_parent_map = load_category_tree(data_folder=data_folder)\n", " \n", " print(\"\\nCalculating popular items for cold-start users...\")\n", " popular_items_list = get_popular_items(train_set, k=10)\n", "\n", " users_in_train_history = set(datamodule.user_history.keys())\n", " users_in_test_set = set(datamodule.test_df['visitorid'].unique())\n", " valid_example_users = list(users_in_train_history.intersection(users_in_test_set))\n", "\n", " print(f\"\\nFound {len(valid_example_users)} users for qualitative analysis.\")\n", " \n", " for user_id in valid_example_users[:3]:\n", " show_user_recommendations(user_id, best_model, datamodule, popular_items_list, item_category_map, category_parent_map)\n", " \n", " new_user_id = -999\n", " show_user_recommendations(new_user_id, best_model, datamodule, popular_items_list, item_category_map, category_parent_map)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0e7ba5f2", "metadata": {}, "outputs": [], "source": [ "main()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }