Spaces:

DanielKiani
/

RecommenderSystem

Sleeping

App Files Files Community

DanielKiani commited on Sep 19, 2025

Commit

38ae75d

0 Parent(s):

Initial commit of recommender system project

Browse files

Files changed (22) hide show

.gitattributes +2 -0
.gitignore +18 -0
README.md +182 -0
assets/banner.png +3 -0
checkpoints/sasrec-epoch=06-val_hitrate@10=0.3629.ckpt +3 -0
notebooks/lightning_logs/sasrec/version_0/events.out.tfevents.1757789891.DESKTOP-48K6QDS.24476.0 +0 -0
notebooks/lightning_logs/sasrec/version_0/hparams.yaml +7 -0
notebooks/lightning_logs/sasrec/version_1/events.out.tfevents.1757790260.DESKTOP-48K6QDS.24476.1 +0 -0
notebooks/lightning_logs/sasrec/version_1/hparams.yaml +7 -0
notebooks/lightning_logs/sasrec/version_2/events.out.tfevents.1757868923.DESKTOP-48K6QDS.22820.0 +0 -0
notebooks/lightning_logs/sasrec/version_2/hparams.yaml +7 -0
notebooks/lightning_logs/sasrec/version_3/events.out.tfevents.1757868963.DESKTOP-48K6QDS.22820.1 +0 -0
notebooks/lightning_logs/sasrec/version_3/hparams.yaml +7 -0
notebooks/reccomender.ipynb +1419 -0
requirements.txt +0 -0
scripts/als_optuna_study.py +53 -0
scripts/app.py +162 -0
scripts/data_prepare.py +253 -0
scripts/main.py +46 -0
scripts/models.py +408 -0
scripts/train_and_eval.py +172 -0
scripts/utils.py +196 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ *.ckpt filter=lfs diff=lfs merge=lfs -text
2	+ *.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+# Python
+__pycache__/
+*.pyc
+# Virtual Environment
+venv/
+.venv/
+# Data and Logs
+data/
+logs/
+notebooks/data/
+notebooks/logs/
+# IDE files
+.vscode/
+.idea/

README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+![Recomm](assets/banner.png)
+[![Python](https://img.shields.io/badge/Python-3.10-blue?logo=python)](https://www.python.org/)[![PyTorch](https://img.shields.io/badge/PyTorch-2.7.1-EE4C2C?logo=pytorch)](https://pytorch.org/)![Made with ML](https://img.shields.io/badge/Made%20with-ML-blueviolet?logo=openai)[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
+# 🚀 End-to-End Sequential Recommender System
+This project implements and evaluates a series of recommender system models, culminating in a state-of-the-art **SASRec (Self-Attentive Sequential Recommendation)** model for Top-N next-item prediction. The system is trained on the [RetailRocket e-commerce dataset](https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset) and includes an interactive web demo built with Gradio.
+![Gradio app](assets/gradio.png)
+You can find the Gradio app [Here](https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews)
+---
+## 📑 Table of Contents
+- [📖 Project Overview](#-project-overview)
+- [✨ Key Features](#-key-features)
+- [🧩 Models Implemented](#-models-implemented)
+- [📊 Final Results](#-final-results)
+- [🔍 Qualitative Analysis](#-qualitative-analysis)
+- [🚧 Future Improvements](#-future-improvements)
+- [📂 Project Structure](#-project-structure)
+- [⚙️ Setup and Usage](#️-setup-and-usage)
+- [🛠️ Technologies and Models Used](#️-technologies-and-models-used)
+---
+## 📖 Project Overview
+The primary goal of this project is to predict the next item a user is likely to interact with based on their recent session history. This is a common and critical task in e-commerce known as Top-N sequential recommendation.
+The project follows a structured approach:
+1. **Baseline Models**: Simple, non-sequential models to establish a performance baseline.
+2. **Hyperparameter Tuning**: Optuna is used to find the optimal configuration for ALS.
+3. **Advanced Sequential Model**: Implementation of **SASRec** with PyTorch Lightning.
+4. **Evaluation**: Offline evaluation using ranking metrics (Hit Rate, Precision, Recall @ 10).
+5. **Interactive Demo**: A Gradio web app for real-time personalized and cold-start recommendations.
+---
+## ✨ Key Features
+- 🔹 **Comprehensive Model Comparison**: From popularity to Transformer-based SASRec.
+- 🔹 **Robust Evaluation**: Time-based data split for realistic performance measurement.
+- 🔹 **Hyperparameter Optimization**: Automated with Optuna for ALS.
+- 🔹 **Deep Learning with Attention**: Full PyTorch Lightning implementation of SASRec.
+- 🔹 **Interactive Web Demo**: Live Gradio app for recommendations.
+- 🔹 **Modular Codebase**: Clean, organized structure.
+---
+## 🧩 Models Implemented
+| Model | Methodology | Key Characteristics |
+| :--- | :--- | :--- |
+| **Popularity** | Non-personalized | Recommends the most frequently purchased items across all users. |
+| **Item-Item CF** | Collaborative Filtering | Recommends items similar to a user’s past interactions. |
+| **ALS** | Matrix Factorization | Learns latent embeddings from implicit feedback, tuned with Optuna. |
+| **SASRec** | Transformer (Self-Attention) | Sequential model capturing contextual user-item interactions. |
+---
+## 📊 Final Results
+SASRec significantly outperformed all baselines, with a **~4.7x improvement in Hit Rate**.
+| Model | Test Hit Rate@10 | Test Precision@10 | Test Recall@10 |
+| :--- | :---: | :---: | :---: |
+| Popularity | 0.0651 | 0.0065 | 0.0324 |
+| Item-Item CF | 0.0021 | 0.0002 | 0.0011 |
+| Tuned ALS | 0.0063 | 0.0006 | 0.0042 |
+| **SASRec** | **0.3069** | **0.0307** | **0.3069** |
+---
+## 🔍 Qualitative Analysis
+The SASRec model not only recommends previously viewed items but also discovers **new, contextually relevant items**.
+For example, for a user browsing **Category 1279**, SASRec suggested new items from the same category — showing strong personalization and discovery.
+---
+## 🚧 Future Improvements
+- 📦 **Incorporate Item Features** (e.g., from `item_properties.csv`).
+- 🤖 **Explore Advanced Models**:
+  - BERT4Rec (bidirectional Transformers).
+  - Graph-based recommender systems.
+- 🧪 **Online A/B Testing** for business impact.
+- ⚡ **Scalability Enhancements**: Feature stores, inference servers (Triton), quantization, distillation.
+---
+## 📂 Project Structure
+```bash
+├── checkpoints/              # Saved PyTorch Lightning checkpoints
+├── data/                     # RetailRocket dataset
+├── notebooks/                # EDA notebooks
+└── scripts/
+    ├── als_optuna_study.py   # Optuna tuning for ALS
+    ├── app.py                # Gradio web demo
+    ├── data_prepare.py       # Data loading & preprocessing
+    ├── main.py               # Entry point for demo
+    ├── models.py             # Model definitions
+    ├── train_and_eval.py     # Training & evaluation loop
+    └── utils.py              # Helper functions
+├── README.md
+└── requirements.txt
+```
+---
+## ⚙️ Setup and Usage
+Follow these steps to set up and run the project locally.
+### 1. Prerequisites
+- Python 3.10.6+
+- An NVIDIA GPU is recommended for training the SASRec model.
+### 2. Clone the Repository
+```bash
+git clone <your-repo-url>
+cd <your-repo-name>
+```
+### 3. Install all required packages
+```bash
+pip install -r requirements.txt
+```
+### 4. Download and Place Data
+- Download the [RetailRocket e-commerce dataset](https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset).
+Then run this script:
+```bash
+python data_prepare.py
+```
+### 5. Run the Full Evaluation
+To train all models and see the final comparison table, run the main script:
+```bash
+python train_and_eval.py
+```
+### 6. Run the main script
+```bash
+python main.py
+```
+---
+## 🛠️ Technologies and Models Used
+This project leverages a range of modern data science and machine learning technologies to build a robust recommender system from the ground up.
+### 🏭 Models
+- **Popularity Model**: A non-personalized baseline that recommends the most frequently purchased items.
+- **Item-Item Collaborative Filtering**: A classical neighborhood-based model that recommends items based on co-occurrence patterns with a user's interaction history.
+- **Alternating Least Squares (ALS)**: A powerful matrix factorization technique for implicit feedback, optimized with hyperparameter tuning.
+- **SASRec (Self-Attentive Sequential Recommendation)**: A state-of-the-art sequential model based on the Transformer architecture, designed to capture the order and context of user interactions.
+### 👩‍💻 Core Technologies & Libraries
+- **Python 3.10**: The primary programming language for the project.
+- **Pandas & NumPy**: For efficient data manipulation, preprocessing, and numerical operations.
+- **Scikit-learn**: Used for calculating item similarity in the collaborative filtering model.
+- **Implicit**: For the ALS model
+- **PyTorch & PyTorch Lightning**: The deep learning framework used to build, train, and evaluate the SASRec model in a structured and scalable way.
+- **Optuna**: A hyperparameter optimization framework used to automatically find the best parameters for the ALS model.
+- **Gradio**: A fast and simple framework used to build and deploy the interactive web demo.
+- **TensorBoard**: For logging and visualizing model training metrics.

assets/banner.png ADDED Viewed

Git LFS Details

SHA256: 5d374ef2d90cdf1206fb3adde8ad7d0e355dfcd1f9f73c803162658041f4d480
Pointer size: 132 Bytes
Size of remote file: 2.04 MB

checkpoints/sasrec-epoch=06-val_hitrate@10=0.3629.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:af98f72c0c5f325aa0393753db2a7e5865336423b222ced1aeb7f164009d0e06
+size 204423643

notebooks/lightning_logs/sasrec/version_0/events.out.tfevents.1757789891.DESKTOP-48K6QDS.24476.0 ADDED Viewed

Binary file (657 Bytes). View file

notebooks/lightning_logs/sasrec/version_0/hparams.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+vocab_size: 64705
+max_len: 256
+hidden_dim: 128
+num_heads: 4
+num_layers: 2
+dropout: 0.1
+learning_rate: 2.0e-05

notebooks/lightning_logs/sasrec/version_1/events.out.tfevents.1757790260.DESKTOP-48K6QDS.24476.1 ADDED Viewed

Binary file (657 Bytes). View file

notebooks/lightning_logs/sasrec/version_1/hparams.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+vocab_size: 64705
+max_len: 256
+hidden_dim: 128
+num_heads: 4
+num_layers: 2
+dropout: 0.1
+learning_rate: 2.0e-05

notebooks/lightning_logs/sasrec/version_2/events.out.tfevents.1757868923.DESKTOP-48K6QDS.22820.0 ADDED Viewed

Binary file (657 Bytes). View file

notebooks/lightning_logs/sasrec/version_2/hparams.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+vocab_size: 64705
+max_len: 50
+hidden_dim: 128
+num_heads: 2
+num_layers: 2
+dropout: 0.2
+learning_rate: 0.001

notebooks/lightning_logs/sasrec/version_3/events.out.tfevents.1757868963.DESKTOP-48K6QDS.22820.1 ADDED Viewed

Binary file (657 Bytes). View file

notebooks/lightning_logs/sasrec/version_3/hparams.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+vocab_size: 64705
+max_len: 50
+hidden_dim: 128
+num_heads: 2
+num_layers: 2
+dropout: 0.2
+learning_rate: 0.001

notebooks/reccomender.ipynb ADDED Viewed

	@@ -0,0 +1,1419 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ae798d43",
+   "metadata": {},
+   "source": [
+    "# 🚀 End-to-End Sequential Recommender System  \n",
+    "\n",
+    "This project implements and evaluates a series of recommender system models, culminating in a state-of-the-art **SASRec (Self-Attentive Sequential Recommendation)** model for Top-N next-item prediction. The system is trained on the [RetailRocket e-commerce dataset](https://www.kaggle.com/datasets/retailrocket/ecommerce-dataset) and includes an interactive web demo built with Gradio.  "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "338759e6",
+   "metadata": {},
+   "source": [
+    "## EDA"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dcc5a23b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Loading events.csv...\n",
+      "Data Head:\n",
+      "       timestamp  visitorid event  itemid  transactionid\n",
+      "0  1433221332117     257597  view  355908            NaN\n",
+      "1  1433224214164     992329  view  248676            NaN\n",
+      "2  1433221999827     111016  view  318965            NaN\n",
+      "3  1433221955914     483717  view  253185            NaN\n",
+      "4  1433221337106     951259  view  367447            NaN\n",
+      "\n",
+      "Data Info:\n",
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 2756101 entries, 0 to 2756100\n",
+      "Data columns (total 5 columns):\n",
+      " #   Column         Dtype  \n",
+      "---  ------         -----  \n",
+      " 0   timestamp      int64  \n",
+      " 1   visitorid      int64  \n",
+      " 2   event          object \n",
+      " 3   itemid         int64  \n",
+      " 4   transactionid  float64\n",
+      "dtypes: float64(1), int64(3), object(1)\n",
+      "memory usage: 105.1+ MB\n",
+      "\n",
+      "Missing Values:\n",
+      "timestamp              0\n",
+      "visitorid              0\n",
+      "event                  0\n",
+      "itemid                 0\n",
+      "transactionid    2733644\n",
+      "dtype: int64\n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "import time\n",
+    "from datetime import datetime\n",
+    "\n",
+    "# Define the path to your data folder\n",
+    "DATA_FOLDER = 'data/'\n",
+    "\n",
+    "# Load the events data\n",
+    "print(\"Loading events.csv...\")\n",
+    "events_df = pd.read_csv(DATA_FOLDER + 'events.csv')\n",
+    "\n",
+    "# --- Initial Inspection ---\n",
+    "\n",
+    "# See the first few rows\n",
+    "print(\"Data Head:\")\n",
+    "print(events_df.head())\n",
+    "\n",
+    "# Get a summary of the dataframe (columns, data types, memory usage)\n",
+    "print(\"\\nData Info:\")\n",
+    "events_df.info()\n",
+    "\n",
+    "# Check for any missing values\n",
+    "print(\"\\nMissing Values:\")\n",
+    "print(events_df.isnull().sum())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "dd89bf40",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Data timeframe is from 2015-05-03 03:00:04.384000 to 2015-09-18 02:59:47.788000\n",
+      "\n",
+      "Event Counts:\n",
+      "event\n",
+      "view           2664312\n",
+      "addtocart        69332\n",
+      "transaction      22457\n",
+      "Name: count, dtype: int64\n",
+      "\n",
+      "Number of unique visitors: 1407580\n",
+      "Number of unique items: 235061\n"
+     ]
+    }
+   ],
+   "source": [
+    "# --- Data Cleaning and Understanding ---\n",
+    "\n",
+    "# 1. Convert timestamp to datetime\n",
+    "# The timestamp is in milliseconds, so we divide by 1000\n",
+    "events_df['timestamp_dt'] = pd.to_datetime(events_df['timestamp'], unit='ms')\n",
+    "print(f\"\\nData timeframe is from {events_df['timestamp_dt'].min()} to {events_df['timestamp_dt'].max()}\")\n",
+    "\n",
+    "\n",
+    "# 2. Analyze the distribution of event types\n",
+    "print(\"\\nEvent Counts:\")\n",
+    "event_counts = events_df['event'].value_counts()\n",
+    "print(event_counts)\n",
+    "\n",
+    "\n",
+    "# 3. Calculate number of unique users and items\n",
+    "n_users = events_df['visitorid'].nunique()\n",
+    "n_items = events_df['itemid'].nunique()\n",
+    "\n",
+    "print(f\"\\nNumber of unique visitors: {n_users}\")\n",
+    "print(f\"Number of unique items: {n_items}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "90fbfb19",
+   "metadata": {},
+   "source": [
+    "## Preparing the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8639638d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import zipfile\n",
+    "import pandas as pd\n",
+    "from datetime import datetime, timedelta\n",
+    "import numpy as np\n",
+    "from scipy.sparse import csr_matrix\n",
+    "import math\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "import pytorch_lightning as pl\n",
+    "\n",
+    "def prepare_data(data_folder='data/', val_days=7, test_days=7):\n",
+    "    \"\"\"\n",
+    "    Loads, preprocesses, and splits the events data into train, validation, and test sets.\n",
+    "    \n",
+    "    args:\n",
+    "        data_folder: str, path to the folder containing 'events.csv'\n",
+    "        val_days: int, number of days for the validation set\n",
+    "        test_days: int, number of days for the test set\n",
+    "    \"\"\"\n",
+    "    # --- Load Data ---\n",
+    "    print(f\"Loading events.csv from folder: {data_folder}\")\n",
+    "    try:\n",
+    "        events_df = pd.read_csv(data_folder + 'events.csv')\n",
+    "        print(\"Successfully loaded events.csv.\")\n",
+    "        events_df['timestamp_dt'] = pd.to_datetime(events_df['timestamp'], unit='ms')\n",
+    "        print(\"\\n--- Initial Data Summary ---\")\n",
+    "        print(f\"Data shape: {events_df.shape}\")\n",
+    "        print(f\"Full timeframe: {events_df['timestamp_dt'].min()} to {events_df['timestamp_dt'].max()}\")\n",
+    "        print(\"----------------------------\\n\")\n",
+    "    except FileNotFoundError:\n",
+    "        print(f\"Error: 'events.csv' not found in '{data_folder}'. Please check the path.\")\n",
+    "        return None, None, None\n",
+    "\n",
+    "    # --- Split Data ---\n",
+    "    sorted_df = events_df.sort_values('timestamp_dt').reset_index(drop=True)\n",
+    "    print(f\"Splitting data: {test_days} days for test, {val_days} for validation.\")\n",
+    "    end_time = sorted_df['timestamp_dt'].max()\n",
+    "    test_start_time = end_time - timedelta(days=test_days)\n",
+    "    val_start_time = test_start_time - timedelta(days=val_days)\n",
+    "\n",
+    "    test_df = sorted_df[sorted_df['timestamp_dt'] >= test_start_time]\n",
+    "    val_df = sorted_df[(sorted_df['timestamp_dt'] >= val_start_time) & (sorted_df['timestamp_dt'] < test_start_time)]\n",
+    "    train_df = sorted_df[sorted_df['timestamp_dt'] < val_start_time]\n",
+    "\n",
+    "    print(\"--- Data Splitting Summary ---\")\n",
+    "    print(f\"Training set:   {train_df.shape[0]:>8} records | from {train_df['timestamp_dt'].min()} to {train_df['timestamp_dt'].max()}\")\n",
+    "    print(f\"Validation set: {val_df.shape[0]:>8} records | from {val_df['timestamp_dt'].min()} to {val_df['timestamp_dt'].max()}\")\n",
+    "    print(f\"Test set:       {test_df.shape[0]:>8} records | from {test_df['timestamp_dt'].min()} to {test_df['timestamp_dt'].max()}\")\n",
+    "    print(\"------------------------------\")\n",
+    "    \n",
+    "    return train_df, val_df, test_df\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f99e4498",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DATA_PATH = \"data\"\n",
+    "\n",
+    "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "96b137cb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SASRecDataset(Dataset):\n",
+    "    \"\"\"\n",
+    "    SASRec Dataset.\n",
+    "    - Precomputes (sequence_id, cutoff_idx) pairs for O(1) __getitem__.\n",
+    "    - Supports 'last' or 'all' target modes.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, sequences, max_len, target_mode=\"last\"):\n",
+    "        \"\"\"\n",
+    "        Args:\n",
+    "            sequences: list of user sequences (list of item IDs).\n",
+    "            max_len: maximum sequence length (padding applied).\n",
+    "            target_mode: 'last' (only last prediction) or 'all' (predict at every step).\n",
+    "        \"\"\"\n",
+    "        self.sequences = sequences\n",
+    "        self.max_len = max_len\n",
+    "        self.target_mode = target_mode\n",
+    "\n",
+    "        # Build index once\n",
+    "        self.index = []\n",
+    "        for seq_id, seq in enumerate(sequences):\n",
+    "            for i in range(1, len(seq)):\n",
+    "                self.index.append((seq_id, i))\n",
+    "\n",
+    "    def __len__(self):\n",
+    "        return len(self.index)\n",
+    "\n",
+    "    def __getitem__(self, idx):\n",
+    "        seq_id, cutoff = self.index[idx]\n",
+    "        seq = self.sequences[seq_id][:cutoff]\n",
+    "\n",
+    "        # Truncate & pad\n",
+    "        seq = seq[-self.max_len:]\n",
+    "        pad_len = self.max_len - len(seq)\n",
+    "\n",
+    "        input_seq = np.zeros(self.max_len, dtype=np.int64)\n",
+    "        input_seq[pad_len:] = seq\n",
+    "\n",
+    "        if self.target_mode == \"last\":\n",
+    "            target = self.sequences[seq_id][cutoff]\n",
+    "            return torch.LongTensor(input_seq), torch.LongTensor([target])\n",
+    "\n",
+    "        elif self.target_mode == \"all\":\n",
+    "            # Predict next item at each step\n",
+    "            target_seq = self.sequences[seq_id][1:cutoff+1]\n",
+    "            target_seq = target_seq[-self.max_len:]\n",
+    "            target = np.zeros(self.max_len, dtype=np.int64)\n",
+    "            target[-len(target_seq):] = target_seq\n",
+    "            return torch.LongTensor(input_seq), torch.LongTensor(target)\n",
+    "\n",
+    "class SASRecDataModule(pl.LightningDataModule):\n",
+    "    \"\"\"\n",
+    "    PyTorch Lightning DataModule for preparing the RetailRocket dataset for the SASRec model.\n",
+    "\n",
+    "    This class handles all aspects of data preparation, including:\n",
+    "    - Filtering out infrequent users and items to reduce noise.\n",
+    "    - Building a consistent item vocabulary.\n",
+    "    - Converting user event histories into sequential data.\n",
+    "    - Creating and providing `DataLoader` instances for training, validation, and testing.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, train_df, val_df, test_df, min_item_interactions=5, \n",
+    "                 min_user_interactions=5, max_len=50, batch_size=256):\n",
+    "        \"\"\"\n",
+    "        Initializes the DataModule.\n",
+    "\n",
+    "        Args:\n",
+    "            train_df (pd.DataFrame): DataFrame for training.\n",
+    "            val_df (pd.DataFrame): DataFrame for validation.\n",
+    "            test_df (pd.DataFrame): DataFrame for testing.\n",
+    "            min_item_interactions (int): Minimum number of interactions for an item to be kept.\n",
+    "            min_user_interactions (int): Minimum number of interactions for a user to be kept.\n",
+    "            max_len (int): The maximum length of a user sequence fed to the model.\n",
+    "            batch_size (int): The batch size for the DataLoaders.\n",
+    "        \"\"\"\n",
+    "        super().__init__()\n",
+    "        self.train_df = train_df\n",
+    "        self.val_df = val_df\n",
+    "        self.test_df = test_df\n",
+    "        self.min_item_interactions = min_item_interactions\n",
+    "        self.min_user_interactions = min_user_interactions\n",
+    "        self.max_len = max_len\n",
+    "        self.batch_size = batch_size\n",
+    "\n",
+    "        self.item_map = None\n",
+    "        self.inverse_item_map = None\n",
+    "        self.vocab_size = 0\n",
+    "        self.user_history = None\n",
+    "\n",
+    "    def setup(self, stage=None):\n",
+    "        \"\"\"\n",
+    "        Prepares the data for training, validation, and testing.\n",
+    "\n",
+    "        This method is called automatically by PyTorch Lightning. It performs the following steps:\n",
+    "        1. Determines filtering criteria (which users and items to keep) based on the training set only\n",
+    "           to prevent data leakage.\n",
+    "        2. Applies these filters to the train, validation, and test sets.\n",
+    "        3. Builds an item vocabulary (mapping item IDs to integer indices) from the combined\n",
+    "           training and validation sets to ensure consistency for model checkpointing.\n",
+    "        4. Converts the event logs into sequences of item indices for each user in each data split.\n",
+    "        \"\"\"\n",
+    "        item_counts = self.train_df['itemid'].value_counts()\n",
+    "        user_counts = self.train_df['visitorid'].value_counts()\n",
+    "        items_to_keep = item_counts[item_counts >= self.min_item_interactions].index\n",
+    "        users_to_keep = user_counts[user_counts >= self.min_user_interactions].index\n",
+    "\n",
+    "        self.filtered_train_df = self.train_df[\n",
+    "            (self.train_df['itemid'].isin(items_to_keep)) & \n",
+    "            (self.train_df['visitorid'].isin(users_to_keep))\n",
+    "        ].copy()\n",
+    "        self.filtered_val_df = self.val_df[\n",
+    "            (self.val_df['itemid'].isin(items_to_keep)) & \n",
+    "            (self.val_df['visitorid'].isin(users_to_keep))\n",
+    "        ].copy()\n",
+    "        self.filtered_test_df = self.test_df[\n",
+    "            (self.test_df['itemid'].isin(items_to_keep)) & \n",
+    "            (self.test_df['visitorid'].isin(users_to_keep))\n",
+    "        ].copy()\n",
+    "\n",
+    "        all_known_items_df = pd.concat([self.filtered_train_df, self.filtered_val_df])\n",
+    "        unique_items = all_known_items_df['itemid'].unique()\n",
+    "        self.item_map = {item_id: i + 1 for i, item_id in enumerate(unique_items)}\n",
+    "        self.inverse_item_map = {i: item_id for item_id, i in self.item_map.items()}\n",
+    "        self.vocab_size = len(self.item_map) + 1 # +1 for padding token 0\n",
+    "\n",
+    "        self.user_history = self.filtered_train_df.groupby('visitorid')['itemid'].apply(list)\n",
+    "        \n",
+    "        self.train_sequences = self._create_sequences(self.filtered_train_df)\n",
+    "        self.val_sequences = self._create_sequences(self.filtered_val_df)\n",
+    "        self.test_sequences = self._create_sequences(self.filtered_test_df)\n",
+    "\n",
+    "    def _create_sequences(self, df):\n",
+    "        \"\"\"\n",
+    "        Helper function to convert a DataFrame of events into user interaction sequences.\n",
+    "        \n",
+    "        Args:\n",
+    "            df (pd.DataFrame): The input DataFrame to process.\n",
+    "\n",
+    "        Returns:\n",
+    "            list[list[int]]: A list of user sequences, where each sequence is a list of item indices.\n",
+    "        \"\"\"\n",
+    "        df_sorted = df.sort_values(['visitorid', 'timestamp_dt'])\n",
+    "        sequences = df_sorted.groupby('visitorid')['itemid'].apply(\n",
+    "            lambda x: [self.item_map[i] for i in x if i in self.item_map]\n",
+    "        ).tolist()\n",
+    "        return [s for s in sequences if len(s) > 1]\n",
+    "\n",
+    "    def train_dataloader(self):\n",
+    "        \"\"\"Creates the DataLoader for the training set.\"\"\"\n",
+    "        dataset = SASRecDataset(self.train_sequences, self.max_len)\n",
+    "        return DataLoader(dataset, batch_size=self.batch_size, shuffle=True, num_workers=0)\n",
+    "\n",
+    "    def val_dataloader(self):\n",
+    "        \"\"\"Creates the DataLoader for the validation set.\"\"\"\n",
+    "        dataset = SASRecDataset(self.val_sequences, self.max_len)\n",
+    "        return DataLoader(dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)\n",
+    "    \n",
+    "    def test_dataloader(self):\n",
+    "        \"\"\"Creates the DataLoader for the test set.\"\"\"\n",
+    "        dataset = SASRecDataset(self.test_sequences, self.max_len)\n",
+    "        return DataLoader(dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "56bdc81c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "BATCH_SIZE = 256       \n",
+    "MAX_TOKEN_LEN = 50     # 50–100 is standard for SASRec\n",
+    "\n",
+    "# --- 1. Prepare the data into train, validation, and test sets ---\n",
+    "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)\n",
+    "\n",
+    "# --- 2. Initialize DataModule ---\n",
+    "print(\"Initializing DataModule...\")\n",
+    "datamodule = SASRecDataModule(\n",
+    "    train_df=train_set,\n",
+    "    val_df=validation_set,\n",
+    "    test_df=test_set,\n",
+    "    batch_size=BATCH_SIZE,\n",
+    "    max_len=MAX_TOKEN_LEN\n",
+    ")\n",
+    "datamodule.setup()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0529207a",
+   "metadata": {},
+   "source": [
+    "## Define train and evaluate the base models "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d899a5a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from datetime import datetime, timedelta\n",
+    "import numpy as np\n",
+    "from scipy.sparse import csr_matrix\n",
+    "from sklearn.metrics.pairwise import cosine_similarity\n",
+    "import implicit\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "import pytorch_lightning as pl\n",
+    "\n",
+    "# --- 1. Evaluation Helper Functions ---\n",
+    "\n",
+    "def prepare_ground_truth(df, mode=\"purchase\", event_weights=None):\n",
+    "    \"\"\"\n",
+    "    Prepares ground truth dictionaries for evaluation.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    df : pd.DataFrame\n",
+    "        Test dataframe containing at least ['visitorid', 'itemid', 'event'].\n",
+    "    mode : str, default=\"purchase\"\n",
+    "        - \"purchase\" : Only use transactions as ground truth.\n",
+    "        - \"all\"      : Use all events. Optionally weight them.\n",
+    "    event_weights : dict, optional\n",
+    "        Example: {\"view\": 1, \"addtocart\": 3, \"transaction\": 5}.\n",
+    "        Used only if mode == \"all\".\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    dict : {user_id: set of item_ids}\n",
+    "    \"\"\"\n",
+    "    if mode == \"purchase\":\n",
+    "        df_filtered = df[df[\"event\"] == \"transaction\"]\n",
+    "        ground_truth = df_filtered.groupby(\"visitorid\")[\"itemid\"].apply(set).to_dict()\n",
+    "\n",
+    "    elif mode == \"all\":\n",
+    "        if event_weights is None:\n",
+    "            # Default: treat all events equally\n",
+    "            ground_truth = df.groupby(\"visitorid\")[\"itemid\"].apply(set).to_dict()\n",
+    "        else:\n",
+    "            # Weighted ground truth (for more advanced eval)\n",
+    "            ground_truth = {}\n",
+    "            for uid, user_df in df.groupby(\"visitorid\"):\n",
+    "                weighted_items = []\n",
+    "                for _, row in user_df.iterrows():\n",
+    "                    weight = event_weights.get(row[\"event\"], 1)\n",
+    "                    weighted_items.extend([row[\"itemid\"]] * weight)\n",
+    "                ground_truth[uid] = set(weighted_items)\n",
+    "    else:\n",
+    "        raise ValueError(\"mode must be 'purchase' or 'all'\")\n",
+    "\n",
+    "    return ground_truth\n",
+    "\n",
+    "def calculate_metrics(recommendations_dict, ground_truth_dict, k):\n",
+    "    \"\"\"\n",
+    "    Calculates Precision@k, Recall@k, and HitRate@k.\n",
+    "\n",
+    "    args:\n",
+    "    ----------\n",
+    "    recommendations_dict : {user_id: [recommended_item_ids]}\n",
+    "    ground_truth_dict : {user_id: set of ground truth item_ids}\n",
+    "    k : int\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    dict with mean precision, recall, and hit rate\n",
+    "    \"\"\"\n",
+    "    all_precisions, all_recalls, all_hits = [], [], []\n",
+    "\n",
+    "    for user_id, true_items in ground_truth_dict.items():\n",
+    "        recs = recommendations_dict.get(user_id, [])[:k]\n",
+    "        if not true_items:\n",
+    "            continue\n",
+    "        hits = len(set(recs) & true_items)\n",
+    "\n",
+    "        precision = hits / k if k > 0 else 0\n",
+    "        recall = hits / len(true_items)\n",
+    "        hit_rate = 1.0 if hits > 0 else 0.0\n",
+    "\n",
+    "        all_precisions.append(precision)\n",
+    "        all_recalls.append(recall)\n",
+    "        all_hits.append(hit_rate)\n",
+    "\n",
+    "    if not all_precisions:\n",
+    "        return {\"mean_precision@k\": 0, \"mean_recall@k\": 0, \"mean_hitrate@k\": 0}\n",
+    "\n",
+    "    return {\n",
+    "        \"mean_precision@k\": np.mean(all_precisions),\n",
+    "        \"mean_recall@k\": np.mean(all_recalls),\n",
+    "        \"mean_hitrate@k\": np.mean(all_hits)\n",
+    "    }\n",
+    "\n",
+    "# --- 2. Model Functions (Popularity, Item-Item, ALS) ---\n",
+    "\n",
+    "def recommend_popular_items_and_evaluate(train_df, test_df, k=10, prepare_ground_truth=None, calculate_metrics=None):\n",
+    "    \"\"\"\n",
+    "    Trains a non-personalized Popularity model and evaluates its performance.\n",
+    "\n",
+    "    This model recommends the top-k most frequently transacted items from the training\n",
+    "    set to every user. It serves as a simple but strong baseline.\n",
+    "\n",
+    "    Args:\n",
+    "        train_df (pd.DataFrame): The training dataset.\n",
+    "        test_df (pd.DataFrame): The test dataset for evaluation.\n",
+    "        k (int): The number of items to recommend.\n",
+    "        prepare_ground_truth (function): A function to process the test_df into a ground truth dict.\n",
+    "        calculate_metrics (function): A function to compute ranking metrics.\n",
+    "\n",
+    "    Returns:\n",
+    "        dict: A dictionary containing the calculated evaluation metrics (e.g., precision, recall).\n",
+    "    \"\"\"\n",
+    "    print(f\"\\n--- Evaluating Popularity Model (Top {k} items) ---\")\n",
+    "    \n",
+    "    # 1. \"Train\" the model by finding the most popular items based on transactions\n",
+    "    purchase_counts = train_df[train_df['event'] == 'transaction']['itemid'].value_counts()\n",
+    "    popular_items = purchase_counts.head(k).index.tolist()\n",
+    "    print(f\"Top {k} popular items identified from training data.\")\n",
+    "\n",
+    "    # 2. Evaluate the model\n",
+    "    ground_truth = prepare_ground_truth(test_df)\n",
+    "    # Every user receives the same list of popular items\n",
+    "    recommendations = {user_id: popular_items for user_id in ground_truth.keys()}\n",
+    "    \n",
+    "    metrics = calculate_metrics(recommendations, ground_truth, k)\n",
+    "    print(\"Evaluation complete.\")\n",
+    "    return metrics\n",
+    "\n",
+    "def recommend_item_item_and_evaluate(train_df, test_df, k=10, min_item_interactions=5, min_user_interactions=5, prepare_ground_truth=None, calculate_metrics=None):\n",
+    "    \"\"\"\n",
+    "    Trains an Item-Item Collaborative Filtering model and evaluates its performance.\n",
+    "\n",
+    "    This model recommends items that are similar to items a user has interacted\n",
+    "    with in the past, based on co-occurrence patterns in the training data.\n",
+    "\n",
+    "    Args:\n",
+    "        train_df (pd.DataFrame): The training dataset.\n",
+    "        test_df (pd.DataFrame): The test dataset for evaluation.\n",
+    "        k (int): The number of items to recommend.\n",
+    "        min_item_interactions (int): Minimum number of interactions for an item to be kept.\n",
+    "        min_user_interactions (int): Minimum number of interactions for a user to be kept.\n",
+    "        prepare_ground_truth (function): A function to process the test_df into a ground truth dict.\n",
+    "        calculate_metrics (function): A function to compute ranking metrics.\n",
+    "\n",
+    "    Returns:\n",
+    "        dict: A dictionary containing the calculated evaluation metrics.\n",
+    "    \"\"\"\n",
+    "    print(f\"\\n--- Evaluating Item-Item CF Model (Top {k} items) ---\")\n",
+    "    \n",
+    "    # 1. Filter out infrequent users and items to reduce noise and computation\n",
+    "    item_counts = train_df['itemid'].value_counts()\n",
+    "    user_counts = train_df['visitorid'].value_counts()\n",
+    "    items_to_keep = item_counts[item_counts >= min_item_interactions].index\n",
+    "    users_to_keep = user_counts[user_counts >= min_user_interactions].index\n",
+    "    filtered_df = train_df[(train_df['itemid'].isin(items_to_keep)) & (train_df['visitorid'].isin(users_to_keep))].copy()\n",
+    "    print(f\"Filtered training data from {len(train_df)} to {len(filtered_df)} records.\")\n",
+    "\n",
+    "    # 2. Create user-item interaction matrix and vocabulary mappings\n",
+    "    user_map = {uid: i for i, uid in enumerate(filtered_df['visitorid'].unique())}\n",
+    "    item_map = {iid: i for i, iid in enumerate(filtered_df['itemid'].unique())}\n",
+    "    inverse_item_map = {i: iid for iid, i in item_map.items()}\n",
+    "    user_indices = filtered_df['visitorid'].map(user_map)\n",
+    "    item_indices = filtered_df['itemid'].map(item_map)\n",
+    "    user_item_matrix = csr_matrix((np.ones(len(filtered_df)), (user_indices, item_indices)))\n",
+    "\n",
+    "    # 3. Calculate the cosine similarity matrix between all items\n",
+    "    print(\"Calculating item similarity matrix...\")\n",
+    "    item_similarity_matrix = cosine_similarity(user_item_matrix.T, dense_output=False)\n",
+    "    print(\"Similarity matrix calculated.\")\n",
+    "\n",
+    "    # 4. Generate recommendations and evaluate\n",
+    "    ground_truth = prepare_ground_truth(test_df)\n",
+    "    recommendations = {}\n",
+    "    print(\"Generating recommendations for users in test set...\")\n",
+    "    test_users = [u for u in ground_truth.keys() if u in user_map]\n",
+    "    \n",
+    "    for user_id in test_users:\n",
+    "        user_index = user_map[user_id]\n",
+    "        user_interactions_indices = user_item_matrix[user_index].indices\n",
+    "        \n",
+    "        if len(user_interactions_indices) > 0:\n",
+    "            # Aggregate scores from items the user has interacted with\n",
+    "            all_scores = np.asarray(item_similarity_matrix[user_interactions_indices].sum(axis=0)).flatten()\n",
+    "            # Remove already interacted items from recommendations\n",
+    "            all_scores[user_interactions_indices] = -1\n",
+    "            top_indices = np.argsort(all_scores)[::-1][:k]\n",
+    "            recs = [inverse_item_map[idx] for idx in top_indices if idx in inverse_item_map]\n",
+    "            recommendations[user_id] = recs\n",
+    "            \n",
+    "    metrics = calculate_metrics(recommendations, ground_truth, k)\n",
+    "    print(\"Evaluation complete.\")\n",
+    "    return metrics\n",
+    "\n",
+    "def recommend_als_and_evaluate(train_df, test_df, k=10, min_item_interactions=5, min_user_interactions=5, \n",
+    "                               factors=25, regularization=0.02, iterations=48, prepare_ground_truth=None, calculate_metrics=None):\n",
+    "    \"\"\"\n",
+    "    Trains an Alternating Least Squares (ALS) model and evaluates its performance.\n",
+    "\n",
+    "    This model uses matrix factorization to learn latent embeddings for users and\n",
+    "    items from implicit feedback data. Default hyperparameters are set from a\n",
+    "    previous Optuna tuning process.\n",
+    "\n",
+    "    Args:\n",
+    "        train_df (pd.DataFrame): The training dataset.\n",
+    "        test_df (pd.DataFrame): The test dataset for evaluation.\n",
+    "        k (int): The number of items to recommend.\n",
+    "        min_item_interactions (int): Minimum number of interactions for an item to be kept.\n",
+    "        min_user_interactions (int): Minimum number of interactions for a user to be kept.\n",
+    "        factors (int): The number of latent factors to compute.\n",
+    "        regularization (float): The regularization factor.\n",
+    "        iterations (int): The number of ALS iterations to run.\n",
+    "        prepare_ground_truth (function): A function to process the test_df into a ground truth dict.\n",
+    "        calculate_metrics (function): A function to compute ranking metrics.\n",
+    "\n",
+    "    Returns:\n",
+    "        dict: A dictionary containing the calculated evaluation metrics.\n",
+    "    \"\"\"\n",
+    "    print(f\"\\n--- Evaluating ALS Model (Top {k} items) ---\")\n",
+    "    \n",
+    "    # 1. Filter data\n",
+    "    item_counts = train_df['itemid'].value_counts()\n",
+    "    user_counts = train_df['visitorid'].value_counts()\n",
+    "    items_to_keep = item_counts[item_counts >= min_item_interactions].index\n",
+    "    users_to_keep = user_counts[user_counts >= min_user_interactions].index\n",
+    "    filtered_df = train_df[(train_df['itemid'].isin(items_to_keep)) & (train_df['visitorid'].isin(users_to_keep))].copy()\n",
+    "    print(f\"Filtered training data from {len(train_df)} to {len(filtered_df)} records.\")\n",
+    "\n",
+    "    # 2. Create mappings and confidence matrix\n",
+    "    user_map = {uid: i for i, uid in enumerate(filtered_df['visitorid'].unique())}\n",
+    "    item_map = {iid: i for i, iid in enumerate(filtered_df['itemid'].unique())}\n",
+    "    inverse_item_map = {i: iid for iid, i in item_map.items()}\n",
+    "    user_indices = filtered_df['visitorid'].map(user_map).astype(np.int32)\n",
+    "    item_indices = filtered_df['itemid'].map(item_map).astype(np.int32)\n",
+    "    \n",
+    "    event_weights = {'view': 1, 'addtocart': 3, 'transaction': 5}\n",
+    "    confidence = filtered_df['event'].map(event_weights).astype(np.float32)\n",
+    "    user_item_matrix = csr_matrix((confidence, (user_indices, item_indices)))\n",
+    "\n",
+    "    # 3. Train the ALS model\n",
+    "    print(\"Training ALS model...\")\n",
+    "    als_model = implicit.als.AlternatingLeastSquares(factors=factors, regularization=regularization, iterations=iterations)\n",
+    "    als_model.fit(user_item_matrix)\n",
+    "    print(\"ALS model trained.\")\n",
+    "\n",
+    "    # 4. Generate recommendations and evaluate\n",
+    "    ground_truth = prepare_ground_truth(test_df)\n",
+    "    recommendations = {}\n",
+    "    print(\"Generating recommendations for users in test set...\")\n",
+    "    test_users_indices = [user_map[u] for u in ground_truth.keys() if u in user_map]\n",
+    "    \n",
+    "    if test_users_indices:\n",
+    "        user_item_matrix_for_recs = user_item_matrix[test_users_indices]\n",
+    "        ids, _ = als_model.recommend(test_users_indices, user_item_matrix_for_recs, N=k)\n",
+    "        \n",
+    "        for i, user_index in enumerate(test_users_indices):\n",
+    "            original_user_id = list(user_map.keys())[list(user_map.values()).index(user_index)]\n",
+    "            recs = [inverse_item_map[item_idx] for item_idx in ids[i] if item_idx in inverse_item_map]\n",
+    "            recommendations[original_user_id] = recs\n",
+    "            \n",
+    "    metrics = calculate_metrics(recommendations, ground_truth, k)\n",
+    "    print(\"Evaluation complete.\")\n",
+    "    return metrics\n",
+    "\n",
+    "\n",
+    "    train_set, validation_set, test_set = prepare_data(data_folder='C:/Users/dania/vsproject/projects/recommernder_system/data/')\n",
+    "    if train_set is not None:\n",
+    "        results = {}\n",
+    "        full_train_set = pd.concat([train_set, validation_set])\n",
+    "        \n",
+    "#         # Evaluate classical models\n",
+    "        print(\"\\n>>> Running evaluations on the VALIDATION set <<<\")\n",
+    "        results['Popularity (Validation)'] = recommend_popular_items_and_evaluate(train_set, validation_set)\n",
+    "        results['Item-Item CF (Validation)'] = recommend_item_item_and_evaluate(train_set, validation_set)\n",
+    "        results['ALS (Validation)'] = recommend_als_and_evaluate(train_set, validation_set)\n",
+    "        \n",
+    "        print(\"\\n>>> Running final evaluations on the TEST set <<<\")\n",
+    "        results['Popularity (Test)'] = recommend_popular_items_and_evaluate(full_train_set, test_set)\n",
+    "        results['Item-Item CF (Test)'] = recommend_item_item_and_evaluate(full_train_set, test_set)\n",
+    "        results['ALS (Test)'] = recommend_als_and_evaluate(full_train_set, test_set)\n",
+    "        \n",
+    "        print(\"\\n--- Final Evaluation Results ---\")\n",
+    "        results_df = pd.DataFrame.from_dict(results, orient='index')\n",
+    "        print(results_df)\n",
+    "        print(\"--------------------------------\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b978c458",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)\n",
+    "if train_set is not None:\n",
+    "    results = {}\n",
+    "    full_train_set = pd.concat([train_set, validation_set])\n",
+    "    \n",
+    "    # Evaluate base models\n",
+    "    print(\"\\n>>> Running evaluations on the VALIDATION set <<<\")\n",
+    "    results['Popularity (Validation)'] = recommend_popular_items_and_evaluate(train_set, validation_set)\n",
+    "    results['Item-Item CF (Validation)'] = recommend_item_item_and_evaluate(train_set, validation_set)\n",
+    "    results['ALS (Validation)'] = recommend_als_and_evaluate(train_set, validation_set)\n",
+    "    \n",
+    "    print(\"\\n>>> Running final evaluations on the TEST set <<<\")\n",
+    "    results['Popularity (Test)'] = recommend_popular_items_and_evaluate(full_train_set, test_set)\n",
+    "    results['Item-Item CF (Test)'] = recommend_item_item_and_evaluate(full_train_set, test_set)\n",
+    "    results['ALS (Test)'] = recommend_als_and_evaluate(full_train_set, test_set)\n",
+    "    \n",
+    "    print(\"\\n--- Final Evaluation Results ---\")\n",
+    "    results_df = pd.DataFrame.from_dict(results, orient='index')\n",
+    "    print(results_df)\n",
+    "    print(\"--------------------------------\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "85d8f78c",
+   "metadata": {},
+   "source": [
+    "## Use Optuna to find the best Hyperparameters for the ALS model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "202be1f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import optuna\n",
+    "\n",
+    "def objective_als(trial, train_df, val_df):\n",
+    "    \"\"\"\n",
+    "    The objective function for Optuna to optimize.\n",
+    "    \"\"\"\n",
+    "    # 1. Define the hyperparameter search space\n",
+    "    params = {\n",
+    "        'factors': trial.suggest_int('factors', 20, 200),\n",
+    "        'regularization': trial.suggest_float('regularization', 1e-3, 1e-1, log=True),\n",
+    "        'iterations': trial.suggest_int('iterations', 10, 50)\n",
+    "    }\n",
+    "    \n",
+    "    # 2. Run an evaluation with the suggested parameters\n",
+    "    metrics = recommend_als_and_evaluate(train_df, val_df, **params)\n",
+    "    \n",
+    "    # 3. Return the metric we want to maximize (precision)\n",
+    "    return metrics['mean_precision@k']\n",
+    "\n",
+    "def tune_als_hyperparameters(train_df, val_df, n_trials=25):\n",
+    "    \"\"\"\n",
+    "    Orchestrates the Optuna study to find the best hyperparameters for ALS.\n",
+    "    \"\"\"\n",
+    "    study = optuna.create_study(direction='maximize')\n",
+    "    study.optimize(lambda trial: objective_als(trial, train_df, val_df), n_trials=n_trials)\n",
+    "    \n",
+    "    print(\"\\n--- Optuna Study Complete ---\")\n",
+    "    print(f\"Number of finished trials: {len(study.trials)}\")\n",
+    "    print(\"Best trial:\")\n",
+    "    trial = study.best_trial\n",
+    "    print(f\"  Value (Precision@10): {trial.value}\")\n",
+    "    print(\"  Params: \")\n",
+    "    for key, value in trial.params.items():\n",
+    "        print(f\"    {key}: {value}\")\n",
+    "        \n",
+    "    return trial.params\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18d48b2e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 1. Prepare all data\n",
+    "train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)\n",
+    "\n",
+    "\n",
+    "# --- Hyperparameter Tuning Step ---\n",
+    "print(\"\\n>>> 1. TUNING ALS Hyperparameters on the VALIDATION set <<<\")\n",
+    "# You can increase n_trials for a more thorough search, e.g., to 50 or 100\n",
+    "best_als_params = tune_als_hyperparameters(train_set, validation_set, n_trials=25) \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9bc9ef8",
+   "metadata": {},
+   "source": [
+    "## Define train and evaluate the SASRec model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4a90d635",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class SASRec(pl.LightningModule):\n",
+    "    \"\"\"\n",
+    "    A PyTorch Lightning implementation of the SASRec model for sequential recommendation.\n",
+    "\n",
+    "    SASRec (Self-Attentive Sequential Recommendation) uses a Transformer-based\n",
+    "    architecture to capture the sequential patterns in a user's interaction history\n",
+    "    to predict the next item they are likely to interact with.\n",
+    "\n",
+    "    Attributes:\n",
+    "        save_hyperparameters: Automatically saves all constructor arguments as hyperparameters.\n",
+    "        item_embedding (nn.Embedding): Embedding layer for item IDs.\n",
+    "        positional_embedding (nn.Embedding): Embedding layer to encode the position of items in a sequence.\n",
+    "        transformer_encoder (nn.TransformerEncoder): The core self-attention module.\n",
+    "        fc (nn.Linear): Final fully connected layer to produce logits over the item vocabulary.\n",
+    "        loss_fn (nn.CrossEntropyLoss): The loss function used for training.\n",
+    "    \"\"\"\n",
+    "    def __init__(self, vocab_size, max_len, hidden_dim, num_heads, num_layers,\n",
+    "                 dropout=0.2, learning_rate=1e-3, weight_decay=1e-6, warmup_steps=2000, max_steps=100000):\n",
+    "        \"\"\"\n",
+    "        Initializes the SASRec model layers and hyperparameters.\n",
+    "\n",
+    "        Args:\n",
+    "            vocab_size (int): The total number of unique items in the dataset (+1 for padding).\n",
+    "            max_len (int): The maximum length of the input sequences.\n",
+    "            hidden_dim (int): The dimensionality of the item and positional embeddings.\n",
+    "            num_heads (int): The number of attention heads in the Transformer encoder.\n",
+    "            num_layers (int): The number of layers in the Transformer encoder.\n",
+    "            dropout (float): The dropout rate to be applied.\n",
+    "            learning_rate (float): The learning rate for the optimizer.\n",
+    "            weight_decay (float): The weight decay (L2 penalty) for the optimizer.\n",
+    "            warmup_steps (int): The number of linear warmup steps for the learning rate scheduler.\n",
+    "            max_steps (int): The total number of training steps for the learning rate scheduler's decay phase.\n",
+    "        \"\"\"\n",
+    "        super().__init__()\n",
+    "        # This saves all hyperparameters to self.hparams, making them accessible later\n",
+    "        self.save_hyperparameters()\n",
+    "\n",
+    "        # Embedding layers\n",
+    "        self.item_embedding = nn.Embedding(vocab_size, hidden_dim, padding_idx=0)\n",
+    "        self.positional_embedding = nn.Embedding(max_len, hidden_dim)\n",
+    "        self.dropout = nn.Dropout(dropout)\n",
+    "\n",
+    "        # Transformer Encoder\n",
+    "        encoder_layer = nn.TransformerEncoderLayer(\n",
+    "            d_model=hidden_dim, nhead=num_heads, dim_feedforward=hidden_dim * 4,\n",
+    "            dropout=dropout, batch_first=True, activation='gelu'\n",
+    "        )\n",
+    "        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)\n",
+    "\n",
+    "        # Output layer\n",
+    "        self.fc = nn.Linear(hidden_dim, vocab_size)\n",
+    "\n",
+    "        # Loss function, ignoring the padding token\n",
+    "        self.loss_fn = nn.CrossEntropyLoss(ignore_index=0)\n",
+    "        \n",
+    "        # Lists to store outputs from validation and test steps\n",
+    "        self.validation_step_outputs = []\n",
+    "        self.test_step_outputs = []\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        \"\"\"\n",
+    "        Defines the forward pass of the model.\n",
+    "\n",
+    "        Args:\n",
+    "            x (torch.Tensor): A batch of input sequences of shape (batch_size, seq_len).\n",
+    "\n",
+    "        Returns:\n",
+    "            torch.Tensor: The output logits of shape (batch_size, seq_len, vocab_size).\n",
+    "        \"\"\"\n",
+    "        seq_len = x.size(1)\n",
+    "        # Create positional indices (0, 1, 2, ..., seq_len-1)\n",
+    "        positions = torch.arange(seq_len, device=self.device).unsqueeze(0)\n",
+    "\n",
+    "        # Create a causal mask to ensure the model doesn't look ahead in the sequence\n",
+    "        causal_mask = nn.Transformer.generate_square_subsequent_mask(seq_len, device=self.device)\n",
+    "\n",
+    "        # Combine item and positional embeddings\n",
+    "        x = self.item_embedding(x) + self.positional_embedding(positions)\n",
+    "        x = self.dropout(x)\n",
+    "        \n",
+    "        # Pass through the Transformer encoder\n",
+    "        x = self.transformer_encoder(x, mask=causal_mask)\n",
+    "        \n",
+    "        # Get final logits\n",
+    "        logits = self.fc(x)\n",
+    "        return logits\n",
+    "\n",
+    "    def training_step(self, batch, batch_idx):\n",
+    "        \"\"\"\n",
+    "        Performs a single training step.\n",
+    "\n",
+    "        Args:\n",
+    "            batch (tuple): A tuple containing input sequences and target items.\n",
+    "            batch_idx (int): The index of the current batch.\n",
+    "\n",
+    "        Returns:\n",
+    "            torch.Tensor: The calculated loss for the batch.\n",
+    "        \"\"\"\n",
+    "        inputs, targets = batch\n",
+    "        logits = self.forward(inputs)\n",
+    "\n",
+    "        # We only care about the prediction for the very last item in the input sequence\n",
+    "        last_logits = logits[:, -1, :]\n",
+    "        \n",
+    "        # Calculate loss against the single target item\n",
+    "        loss = self.loss_fn(last_logits, targets.squeeze())\n",
+    "        \n",
+    "        self.log('train_loss', loss, prog_bar=True, on_step=True, on_epoch=True)\n",
+    "        return loss\n",
+    "\n",
+    "    def validation_step(self, batch, batch_idx):\n",
+    "        \"\"\"\n",
+    "        Performs a single validation step.\n",
+    "        Calculates loss and stores predictions for metric computation at the end of the epoch.\n",
+    "        \"\"\"\n",
+    "        inputs, targets = batch\n",
+    "        logits = self.forward(inputs)\n",
+    "        last_item_logits = logits[:, -1, :]\n",
+    "        loss = self.loss_fn(last_item_logits, targets.squeeze())\n",
+    "        self.log('val_loss', loss, prog_bar=True, on_epoch=True)\n",
+    "\n",
+    "        # Get top-10 predictions for metric calculation\n",
+    "        top_k_preds = torch.topk(last_item_logits, 10, dim=-1).indices\n",
+    "        self.validation_step_outputs.append({'preds': top_k_preds, 'targets': targets})\n",
+    "        return loss\n",
+    "\n",
+    "    def on_validation_epoch_end(self):\n",
+    "        \"\"\"\n",
+    "        Calculates and logs ranking metrics at the end of the validation epoch.\n",
+    "        \"\"\"\n",
+    "        if not self.validation_step_outputs: return\n",
+    "\n",
+    "        # Concatenate all predictions and targets from the epoch\n",
+    "        preds = torch.cat([x['preds'] for x in self.validation_step_outputs], dim=0)\n",
+    "        targets = torch.cat([x['targets'] for x in self.validation_step_outputs], dim=0)\n",
+    "\n",
+    "        k = preds.size(1)\n",
+    "        # Check if the target is in the top-k predictions for each example\n",
+    "        hits_tensor = (preds == targets).any(dim=1)\n",
+    "        num_hits = hits_tensor.sum().item()\n",
+    "        num_targets = len(targets)\n",
+    "\n",
+    "        if num_targets > 0:\n",
+    "            hit_rate = num_hits / num_targets\n",
+    "            recall = hit_rate  # For next-item prediction, recall@k is the same as hit_rate@k\n",
+    "            precision = num_hits / (k * num_targets)\n",
+    "        else:\n",
+    "            hit_rate, recall, precision = 0.0, 0.0, 0.0\n",
+    "\n",
+    "        self.log('val_hitrate@10', hit_rate, prog_bar=True)\n",
+    "        self.log('val_precision@10', precision, prog_bar=True)\n",
+    "        self.log('val_recall@10', recall, prog_bar=True)\n",
+    "\n",
+    "        self.validation_step_outputs.clear() # Free up memory\n",
+    "\n",
+    "    def test_step(self, batch, batch_idx):\n",
+    "        \"\"\"\n",
+    "        Performs a single test step.\n",
+    "        Mirrors the logic of the validation_step.\n",
+    "        \"\"\"\n",
+    "        inputs, targets = batch\n",
+    "        logits = self.forward(inputs)\n",
+    "        last_item_logits = logits[:, -1, :]\n",
+    "        loss = self.loss_fn(last_item_logits, targets.squeeze())\n",
+    "        self.log('test_loss', loss, prog_bar=True)\n",
+    "\n",
+    "        top_k_preds = torch.topk(last_item_logits, 10, dim=-1).indices\n",
+    "        self.test_step_outputs.append({'preds': top_k_preds, 'targets': targets})\n",
+    "        return loss\n",
+    "\n",
+    "    def on_test_epoch_end(self):\n",
+    "        \"\"\"\n",
+    "        Calculates and logs ranking metrics at the end of the test epoch.\n",
+    "        \"\"\"\n",
+    "        if not self.test_step_outputs: return\n",
+    "\n",
+    "        preds = torch.cat([x['preds'] for x in self.test_step_outputs], dim=0)\n",
+    "        targets = torch.cat([x['targets'] for x in self.test_step_outputs], dim=0)\n",
+    "\n",
+    "        k = preds.size(1)\n",
+    "        hits_tensor = (preds == targets).any(dim=1)\n",
+    "        num_hits = hits_tensor.sum().item()\n",
+    "        num_targets = len(targets)\n",
+    "\n",
+    "        if num_targets > 0:\n",
+    "            hit_rate = num_hits / num_targets\n",
+    "            recall = hit_rate\n",
+    "            precision = num_hits / (k * num_targets)\n",
+    "        else:\n",
+    "            hit_rate, recall, precision = 0.0, 0.0, 0.0\n",
+    "\n",
+    "        self.log('test_hitrate@10', hit_rate, prog_bar=True)\n",
+    "        self.log('test_precision@10', precision, prog_bar=True)\n",
+    "        self.log('test_recall@10', recall, prog_bar=True)\n",
+    "\n",
+    "        self.test_step_outputs.clear() # Free up memory\n",
+    "\n",
+    "    def configure_optimizers(self):\n",
+    "        \"\"\"\n",
+    "        Configures the optimizer and learning rate scheduler.\n",
+    "        \n",
+    "        Uses AdamW optimizer and a linear warmup followed by a cosine decay schedule,\n",
+    "        which is a standard practice for training Transformer models.\n",
+    "        \"\"\"\n",
+    "        optimizer = torch.optim.AdamW(\n",
+    "            self.parameters(),\n",
+    "            lr=self.hparams.learning_rate,\n",
+    "            weight_decay=self.hparams.weight_decay\n",
+    "        )\n",
+    "        \n",
+    "        # Learning rate scheduler: linear warmup and cosine decay\n",
+    "        def lr_lambda(current_step: int):\n",
+    "            warmup_steps = self.hparams.warmup_steps\n",
+    "            max_steps = self.hparams.max_steps\n",
+    "            if current_step < warmup_steps:\n",
+    "                return float(current_step) / float(max(1, warmup_steps))\n",
+    "            progress = float(current_step - warmup_steps) / float(max(1, max_steps - warmup_steps))\n",
+    "            return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))\n",
+    "\n",
+    "        scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)\n",
+    "\n",
+    "        return {\n",
+    "            \"optimizer\": optimizer,\n",
+    "            \"lr_scheduler\": {\n",
+    "                \"scheduler\": scheduler,\n",
+    "                \"interval\": \"step\",  # Update the scheduler at every training step\n",
+    "                \"frequency\": 1\n",
+    "            }\n",
+    "        }"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20bbc93a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pytorch_lightning as pl\n",
+    "from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping\n",
+    "from pytorch_lightning.loggers import TensorBoardLogger\n",
+    "import torch\n",
+    "\n",
+    "def train_and_eval_SASRec_model(train_set, validation_set, test_set, checkpoint_dir_path='checkpoints/',\n",
+    "                                checkpoint_path=None, n_epochs=10, mode='train',\n",
+    "                                batchsize=256, max_token_len=50, learning_rate=1e-3, hidden_dim=128,\n",
+    "                                num_heads=2, num_layers=2, dropout=0.2, weight_decay=1e-6):\n",
+    "    \"\"\"\n",
+    "    Train or evaluate a SASRec sequential recommendation model using PyTorch Lightning.\n",
+    "\n",
+    "    This function wraps the entire SASRec pipeline:\n",
+    "    - Initializes the SASRecDataModule (handles dataset preprocessing and dataloaders).\n",
+    "    - Builds the SASRec Transformer-based model.\n",
+    "    - Configures training callbacks (checkpointing, early stopping, LR monitoring).\n",
+    "    - Runs either training (`mode='train'`) or evaluation on the test set (`mode='test'`).\n",
+    "\n",
+    "    Args\n",
+    "    ----------\n",
+    "    train_set : pd.DataFrame\n",
+    "        Training interactions dataset .\n",
+    "    validation_set : pd.DataFrame\n",
+    "        Validation dataset with the same structure as `train_set`.\n",
+    "    test_set : pd.DataFrame\n",
+    "        Test dataset with the same structure as `train_set`.\n",
+    "    checkpoint_dir_path : str, optional (default='checkpoints/')\n",
+    "        Directory to save model checkpoints.\n",
+    "    checkpoint_path : str or None, optional (default=None)\n",
+    "        Path to a checkpoint file for resuming training or loading a pretrained model for testing.\n",
+    "    n_epochs : int, optional (default=10)\n",
+    "        Number of training epochs.\n",
+    "    mode : {'train', 'test'}, optional (default='train')\n",
+    "        - `'train'`: trains the model on the training/validation data.\n",
+    "        - `'test'`: evaluates the model on the test set using a checkpoint.\n",
+    "    batchsize : int, optional (default=256)\n",
+    "        Batch size for training and evaluation.\n",
+    "    max_token_len : int, optional (default=50)\n",
+    "        Maximum sequence length per user (recent interactions kept).\n",
+    "    learning_rate : float, optional (default=1e-3)\n",
+    "        Learning rate for the AdamW optimizer.\n",
+    "    hidden_dim : int, optional (default=128)\n",
+    "        Dimensionality of item and positional embeddings.\n",
+    "    num_heads : int, optional (default=2)\n",
+    "        Number of attention heads in each Transformer encoder layer.\n",
+    "    num_layers : int, optional (default=2)\n",
+    "        Number of Transformer encoder layers.\n",
+    "    dropout : float, optional (default=0.2)\n",
+    "        Dropout probability applied in embeddings and Transformer layers.\n",
+    "    weight_decay : float, optional (default=1e-6)\n",
+    "        Weight decay regularization coefficient for AdamW.\n",
+    "    \"\"\"\n",
+    "    # --- 1. Initialize DataModule ---\n",
+    "    print(\"Initializing DataModule...\")\n",
+    "    datamodule = SASRecDataModule(\n",
+    "        train_df=train_set,\n",
+    "        val_df=validation_set,\n",
+    "        test_df=test_set,\n",
+    "        batch_size=batchsize,\n",
+    "        max_len=max_token_len\n",
+    "    )\n",
+    "    datamodule.setup()\n",
+    "\n",
+    "    # --- 2. Initialize Model ---\n",
+    "    print(\"Initializing SASRec model...\")\n",
+    "    model = SASRec(\n",
+    "        vocab_size=datamodule.vocab_size,\n",
+    "        max_len=max_token_len,\n",
+    "        hidden_dim=hidden_dim,\n",
+    "        num_heads=num_heads,\n",
+    "        num_layers=num_layers,\n",
+    "        dropout=dropout,\n",
+    "        learning_rate=learning_rate,\n",
+    "        weight_decay=weight_decay\n",
+    "    )\n",
+    "\n",
+    "    # --- 3. Configure Training Callbacks ---\n",
+    "    checkpoint_callback = ModelCheckpoint(\n",
+    "        dirpath=checkpoint_dir_path,\n",
+    "        filename=\"sasrec-{epoch:02d}-{val_hitrate@10:.4f}\",\n",
+    "        save_top_k=1,\n",
+    "        verbose=True,\n",
+    "        monitor=\"val_hitrate@10\",\n",
+    "        mode=\"max\"\n",
+    "    )\n",
+    "\n",
+    "    early_stopping_callback = EarlyStopping(\n",
+    "        monitor=\"val_hitrate@10\",   # stop if ranking metric stagnates\n",
+    "        patience=5,\n",
+    "        mode=\"max\"\n",
+    "    )\n",
+    "\n",
+    "    lr_monitor = LearningRateMonitor(logging_interval=\"step\")\n",
+    "\n",
+    "    logger = TensorBoardLogger(\"lightning_logs\", name=\"sasrec\")\n",
+    "\n",
+    "    # --- 4. Initialize Trainer ---\n",
+    "    print(\"Initializing PyTorch Lightning Trainer...\")\n",
+    "    trainer = pl.Trainer(\n",
+    "        logger=logger,\n",
+    "        callbacks=[checkpoint_callback, early_stopping_callback, lr_monitor],\n",
+    "        max_epochs=n_epochs,\n",
+    "        accelerator='auto',\n",
+    "        devices=1,\n",
+    "        gradient_clip_val=1.0,  # helps with exploding gradients\n",
+    "    )\n",
+    "\n",
+    "    if mode == 'train' :\n",
+    "        # --- 5. Start Training ---\n",
+    "        print(f\"Starting training for up to {n_epochs} epochs...\")\n",
+    "        trainer.fit(model, datamodule,\n",
+    "                    ckpt_path=checkpoint_path\n",
+    "                    )\n",
+    "\n",
+    "    elif mode == 'test':\n",
+    "        # --- 6. Test on best checkpoint ---\n",
+    "        print(\"Evaluating on test set...\")\n",
+    "        trainer.test(model, datamodule,\n",
+    "                     ckpt_path=checkpoint_path\n",
+    "                    )\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d4a2a7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# --- Configuration ---\n",
+    "BATCH_SIZE = 256\n",
+    "MAX_TOKEN_LEN = 50     # 50–100 is standard\n",
+    "LEARNING_RATE = 1e-3\n",
+    "HIDDEN_DIM = 128\n",
+    "NUM_HEADS = 2\n",
+    "NUM_LAYERS = 2\n",
+    "DROPOUT = 0.2\n",
+    "WEIGHT_DECAY = 1e-6\n",
+    "N_EPOCHS = 50\n",
+    "MODE = 'train'  # 'train' or 'test'\n",
+    "\n",
+    "# Train and evaluate SASRec model\n",
+    "print(\"\\n>>> Training and evaluating SASRec model <<<\")\n",
+    "train_and_eval_SASRec_model(train_set, validation_set, test_set, n_epochs=10, mode='train')\n",
+    "\n",
+    "print(\"\\n>>> Evaluating trained SASRec model on TEST set <<<\")\n",
+    "train_and_eval_SASRec_model(train_set, validation_set, test_set, mode='test')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "468e0951",
+   "metadata": {},
+   "source": [
+    "## Main function to run the complete Recommender System"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f810e9a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_item_properties(data_folder='data/'):\n",
+    "    \"\"\"\n",
+    "    Loads item properties and creates a mapping from item ID to its category ID.\n",
+    "    Handles both a single properties file or two split parts.\n",
+    "    \n",
+    "    Args:\n",
+    "        data_folder (str): The path to the folder containing item property files.\n",
+    "\n",
+    "    Returns:\n",
+    "        dict: A dictionary mapping {itemid: categoryid}.\n",
+    "    \"\"\"\n",
+    "    print(\"Loading item properties...\")\n",
+    "    try:\n",
+    "        # First, try to load the two separate parts and combine them.\n",
+    "        props_df_part1 = pd.read_csv(data_folder + 'item_properties_part1.csv')\n",
+    "        props_df_part2 = pd.read_csv(data_folder + 'item_properties_part2.csv')\n",
+    "        props_df = pd.concat([props_df_part1, props_df_part2], ignore_index=True)\n",
+    "        print(\"Successfully loaded and combined item_properties_part1.csv and item_properties_part2.csv.\")\n",
+    "\n",
+    "    except FileNotFoundError:\n",
+    "        try:\n",
+    "            # If the parts are not found, try to load a single combined file.\n",
+    "            props_df = pd.read_csv(data_folder + 'item_properties.csv')\n",
+    "            print(\"Successfully loaded a single item_properties.csv.\")\n",
+    "        except FileNotFoundError:\n",
+    "            print(f\"Warning: No item properties files found. Cannot display category information.\")\n",
+    "            return {}\n",
+    "\n",
+    "    category_df = props_df[props_df['property'] == 'categoryid'].copy()\n",
+    "    category_df['value'] = pd.to_numeric(category_df['value'], errors='coerce').astype('Int64')\n",
+    "    item_to_category_map = category_df.set_index('itemid')['value'].to_dict()\n",
+    "    print(\"Item to category mapping created successfully.\")\n",
+    "    return item_to_category_map\n",
+    "\n",
+    "def load_category_tree(data_folder='data/'):\n",
+    "    \"\"\"\n",
+    "    Loads the category tree to map categories to their parent categories.\n",
+    "\n",
+    "    Args:\n",
+    "        data_folder (str): The path to the folder containing category_tree.csv.\n",
+    "\n",
+    "    Returns:\n",
+    "        dict: A dictionary mapping {categoryid: parentid}.\n",
+    "    \"\"\"\n",
+    "    print(\"Loading category tree...\")\n",
+    "    try:\n",
+    "        tree_df = pd.read_csv(data_folder + 'category_tree.csv')\n",
+    "        category_parent_map = tree_df.set_index('categoryid')['parentid'].to_dict()\n",
+    "        print(\"Category tree loaded successfully.\")\n",
+    "        return category_parent_map\n",
+    "    except FileNotFoundError:\n",
+    "        print(\"Warning: 'category_tree.csv' not found. Cannot display parent category information.\")\n",
+    "        return {}\n",
+    "\n",
+    "def get_popular_items(train_df, k=10):\n",
+    "    \"\"\"\n",
+    "    Calculates the top-k most popular items based on transaction count.\n",
+    "    \"\"\"\n",
+    "    purchase_counts = train_df[train_df['event'] == 'transaction']['itemid'].value_counts()\n",
+    "    return purchase_counts.head(k).index.tolist()\n",
+    "\n",
+    "def show_user_recommendations(visitor_id, model, datamodule, popular_items, item_category_map, category_parent_map, k=10):\n",
+    "    \"\"\"\n",
+    "    Displays recommendations for a user, including category and parent category information.\n",
+    "    \"\"\"\n",
+    "    print(f\"\\n--- Recommendations for Visitor ID: {visitor_id} ---\")\n",
+    "    model.eval()\n",
+    "\n",
+    "    def format_item_with_category(item_id):\n",
+    "        category_id = item_category_map.get(item_id, 'N/A')\n",
+    "        parent_id = category_parent_map.get(category_id, 'N/A') if category_id != 'N/A' else 'N/A'\n",
+    "        return f\"Item: {item_id} (Category: {category_id}, Parent: {parent_id})\"\n",
+    "\n",
+    "    user_history_ids = datamodule.user_history.get(visitor_id)\n",
+    "\n",
+    "    if user_history_ids is None:\n",
+    "        print(f\"User {visitor_id} not found in training history. Providing popularity-based recommendations.\")\n",
+    "        print(f\"\\nTop {k} Popular Items (Fallback):\")\n",
+    "        recs_with_cats = [format_item_with_category(item_id) for item_id in popular_items]\n",
+    "        print(recs_with_cats)\n",
+    "        print(\"-------------------------------------------------\")\n",
+    "        return\n",
+    "\n",
+    "    history_with_cats = [format_item_with_category(item_id) for item_id in user_history_ids]\n",
+    "    print(f\"User's Historical Interactions:\")\n",
+    "    print(history_with_cats)\n",
+    "\n",
+    "    history_indices = [datamodule.item_map[i] for i in user_history_ids if i in datamodule.item_map]\n",
+    "    if not history_indices:\n",
+    "        print(\"None of the user's historical items are in the model's vocabulary.\")\n",
+    "        return\n",
+    "\n",
+    "    max_len = datamodule.max_len\n",
+    "    input_seq = history_indices[-max_len:]\n",
+    "    padded_input = np.zeros(max_len, dtype=np.int64)\n",
+    "    padded_input[-len(input_seq):] = input_seq\n",
+    "    \n",
+    "    input_tensor = torch.LongTensor(np.array([padded_input]))\n",
+    "    input_tensor = input_tensor.to(model.device)\n",
+    "\n",
+    "    with torch.no_grad():\n",
+    "        logits = model(input_tensor)\n",
+    "        last_item_logits = logits[0, -1, :]\n",
+    "        top_indices = torch.topk(last_item_logits, k).indices.tolist()\n",
+    "\n",
+    "    recommended_item_ids = [datamodule.inverse_item_map[idx] for idx in top_indices if idx in datamodule.inverse_item_map]\n",
+    "\n",
+    "    print(f\"\\nTop {k} Recommended Items:\")\n",
+    "    recs_with_cats = [format_item_with_category(item_id) for item_id in recommended_item_ids]\n",
+    "    print(recs_with_cats)\n",
+    "    print(\"-------------------------------------------------\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "735d0f8d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def main(checkpoint_path=\"checkpoints/sasrec-epoch=06-val_hitrate@10=0.3629.ckpt\", data_folder=\"data/\"):\n",
+    "    \"\"\"\n",
+    "    Main function to run the inference and qualitative analysis pipeline.\n",
+    "    \"\"\"\n",
+    "\n",
+    "    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
+    "    print(f\"Using device: {device}\")\n",
+    "\n",
+    "    print(\"Loading model from checkpoint...\")\n",
+    "    best_model = SASRec.load_from_checkpoint(checkpoint_path)\n",
+    "    best_model.to(device)\n",
+    "\n",
+    "    print(\"Preparing data...\")\n",
+    "    train_set, validation_set, test_set = prepare_data(data_folder=data_folder)\n",
+    "    \n",
+    "    datamodule = SASRecDataModule(train_set, validation_set, test_set)\n",
+    "    datamodule.setup()\n",
+    "    \n",
+    "    item_category_map = load_item_properties(data_folder=data_folder)\n",
+    "    category_parent_map = load_category_tree(data_folder=data_folder)\n",
+    "    \n",
+    "    print(\"\\nCalculating popular items for cold-start users...\")\n",
+    "    popular_items_list = get_popular_items(train_set, k=10)\n",
+    "\n",
+    "    users_in_train_history = set(datamodule.user_history.keys())\n",
+    "    users_in_test_set = set(datamodule.test_df['visitorid'].unique())\n",
+    "    valid_example_users = list(users_in_train_history.intersection(users_in_test_set))\n",
+    "\n",
+    "    print(f\"\\nFound {len(valid_example_users)} users for qualitative analysis.\")\n",
+    "    \n",
+    "    for user_id in valid_example_users[:3]:\n",
+    "        show_user_recommendations(user_id, best_model, datamodule, popular_items_list, item_category_map, category_parent_map)\n",
+    "        \n",
+    "    new_user_id = -999\n",
+    "    show_user_recommendations(new_user_id, best_model, datamodule, popular_items_list, item_category_map, category_parent_map)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e7ba5f2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "main()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

requirements.txt ADDED Viewed

Binary file (304 Bytes). View file

scripts/als_optuna_study.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import optuna
+import numpy as np
+import math
+from scipy.sparse import csr_matrix
+from sklearn.metrics.pairwise import cosine_similarity
+import implicit
+from utils import prepare_ground_truth, calculate_metrics
+from models import recommend_als_and_evaluate
+from data_prepare import prepare_data
+def objective_als(trial, train_df, val_df):
+    """
+    The objective function for Optuna to optimize.
+    """
+    # 1. Define the hyperparameter search space
+    params = {
+        'factors': trial.suggest_int('factors', 20, 200),
+        'regularization': trial.suggest_float('regularization', 1e-3, 1e-1, log=True),
+        'iterations': trial.suggest_int('iterations', 10, 50)
+    }
+    # 2. Run an evaluation with the suggested parameters
+    metrics = recommend_als_and_evaluate(train_df, val_df, **params)
+    # 3. Return the metric we want to maximize (precision)
+    return metrics['mean_precision@k']
+def tune_als_hyperparameters(train_df, val_df, n_trials=25):
+    """
+    Orchestrates the Optuna study to find the best hyperparameters for ALS.
+    """
+    study = optuna.create_study(direction='maximize')
+    study.optimize(lambda trial: objective_als(trial, train_df, val_df), n_trials=n_trials)
+    print("\n--- Optuna Study Complete ---")
+    print(f"Number of finished trials: {len(study.trials)}")
+    print("Best trial:")
+    trial = study.best_trial
+    print(f"  Value (Precision@10): {trial.value}")
+    print("  Params: ")
+    for key, value in trial.params.items():
+        print(f"    {key}: {value}")
+    return trial.params
+if __name__ == "__main__":
+    # 1. Prepare all data
+    train_set, validation_set, test_set = prepare_data()
+    # --- Hyperparameter Tuning Step ---
+    print("\n>>> 1. TUNING ALS Hyperparameters on the VALIDATION set <<<")
+    best_als_params = tune_als_hyperparameters(train_set, validation_set, n_trials=25)

scripts/app.py ADDED Viewed

	@@ -0,0 +1,162 @@

+import gradio as gr
+import torch
+import numpy as np
+import pandas as pd
+from datetime import datetime, timedelta
+from models import SASRec, SASRecDataModule
+from data_prepare import SASRecDataset, SASRecDataModule, prepare_data
+from utils import load_item_properties, load_category_tree, get_popular_items
+# --- Global variables to hold loaded artifacts ---
+# This prevents reloading the model and data on every prediction.
+MODEL = None
+DATAMODULE = None
+ITEM_CATEGORY_MAP = None
+CATEGORY_PARENT_MAP = None
+POPULAR_ITEMS = None
+# --- Data Loading and Preparation Functions ---
+def load_artifacts():
+    """
+    Loads all necessary artifacts (model, data, mappings) into global variables.
+    This function is called only once when the app starts.
+    """
+    global MODEL, DATAMODULE, ITEM_CATEGORY_MAP, CATEGORY_PARENT_MAP, POPULAR_ITEMS
+    print("--- Loading all artifacts for the Gradio app ---")
+    # HF-FRIENDLY: Path is relative, assuming the checkpoint is in the root of the Space repo.
+    CHECKPOINT_PATH = "sasrec-epoch=05-val_hitrate@10=0.3614.ckpt"
+    DATA_FOLDER = "data/"
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    print(f"Loading model from checkpoint: {CHECKPOINT_PATH}...")
+    MODEL = SASRec.load_from_checkpoint(CHECKPOINT_PATH)
+    MODEL.to(device)
+    MODEL.eval()
+    print("Preparing data...")
+    train_set, validation_set, test_set = prepare_data(data_folder=DATA_FOLDER)
+    DATAMODULE = SASRecDataModule(train_set, validation_set, test_set)
+    DATAMODULE.setup()
+    print("Loading item and category maps...")
+    ITEM_CATEGORY_MAP = load_item_properties(data_folder=DATA_FOLDER)
+    CATEGORY_PARENT_MAP = load_category_tree(data_folder=DATA_FOLDER)
+    print("Calculating popular items for cold-start users...")
+    POPULAR_ITEMS = get_popular_items(train_set, k=10)
+    print("--- Artifacts loaded successfully. Ready to serve recommendations. ---")
+def get_recommendations(visitor_id_str):
+    """
+    The main prediction function for the Gradio interface.
+    Takes a visitor ID string, gets recommendations, and formats them for display.
+    """
+    try:
+        visitor_id = int(visitor_id_str)
+    except (ValueError, TypeError):
+        return pd.DataFrame(), pd.DataFrame(), "Please enter a valid numerical Visitor ID."
+    user_history_ids = DATAMODULE.user_history.get(visitor_id)
+    def format_to_df(item_list):
+        data = []
+        for rank, item_id in enumerate(item_list, 1):
+            category_id = ITEM_CATEGORY_MAP.get(item_id, 'N/A')
+            parent_id = CATEGORY_PARENT_MAP.get(category_id, 'N/A') if pd.notna(category_id) else 'N/A'
+            data.append([rank, item_id, category_id, parent_id])
+        return pd.DataFrame(data, columns=['Rank', 'Item ID', 'Category ID', 'Parent ID'])
+    # --- Cold-Start User (Fallback to Popularity) ---
+    if user_history_ids is None:
+        history_df = pd.DataFrame(columns=['Rank', 'Item ID', 'Category ID', 'Parent ID'])
+        recs_df = format_to_df(POPULAR_ITEMS)
+        message = f"User {visitor_id} is new. Showing Top 10 popular items as a fallback."
+        return history_df, recs_df, message
+    # --- Existing User (Use SASRec Model) ---
+    history_df = format_to_df(user_history_ids)
+    history_indices = [DATAMODULE.item_map[i] for i in user_history_ids if i in DATAMODULE.item_map]
+    if not history_indices:
+        message = "None of this user's historical items are in the model's vocabulary."
+        return history_df, pd.DataFrame(), message
+    max_len = DATAMODULE.max_len
+    input_seq = history_indices[-max_len:]
+    padded_input = np.zeros(max_len, dtype=np.int64)
+    padded_input[-len(input_seq):] = input_seq
+    input_tensor = torch.LongTensor(np.array([padded_input]))
+    input_tensor = input_tensor.to(MODEL.device)
+    with torch.no_grad():
+        logits = MODEL(input_tensor)
+        last_item_logits = logits[0, -1, :]
+        top_indices = torch.topk(last_item_logits, 10).indices.tolist()
+    recommended_item_ids = [DATAMODULE.inverse_item_map[idx] for idx in top_indices if idx in DATAMODULE.inverse_item_map]
+    recs_df = format_to_df(recommended_item_ids)
+    message = f"Showing personalized SASRec recommendations for user {visitor_id}."
+    return history_df, recs_df, message
+# --- Main Execution Block ---
+if __name__ == "__main__":
+    # Load all artifacts once at startup
+    load_artifacts()
+    # Find some valid example users to show in the UI
+    users_in_train_history = set(DATAMODULE.user_history.keys())
+    users_in_test_set = set(DATAMODULE.test_df['visitorid'].unique())
+    valid_example_users = list(users_in_train_history.intersection(users_in_test_set))
+    # Convert numpy types to standard Python int for Gradio compatibility
+    example_list = [int(u) for u in valid_example_users[:4]] + [-999]
+    # Create and launch the Gradio interface
+    with gr.Blocks(theme=gr.themes.Soft(), title="SASRec Recommender") as iface:
+        gr.Markdown(
+            """
+            # SASRec Sequential Recommender System
+            An interactive demo of a state-of-the-art recommender system trained on the RetailRocket dataset.
+            """
+        )
+        with gr.Row():
+            with gr.Column(scale=1):
+                visitor_id_input = gr.Number(
+                    label="Enter Visitor ID",
+                    info="Enter a user's numerical ID to get recommendations."
+                )
+                submit_button = gr.Button("Get Recommendations", variant="primary")
+                gr.Examples(
+                    examples=example_list,
+                    inputs=visitor_id_input,
+                    label="Example User IDs (Click to try)"
+                )
+            with gr.Column(scale=3):
+                status_message = gr.Textbox(label="Status", interactive=False)
+                with gr.Tabs():
+                    with gr.TabItem("Top 10 Recommendations"):
+                        recs_output = gr.DataFrame(label="Recommended Items")
+                    with gr.TabItem("User's Recent History"):
+                        history_output = gr.DataFrame(label="Interaction History")
+        submit_button.click(
+            fn=get_recommendations,
+            inputs=visitor_id_input,
+            outputs=[history_output, recs_output, status_message]
+        )
+    # For local testing, this creates a shareable link.
+    # On Hugging Face Spaces, this is not strictly necessary but doesn't hurt.
+    iface.launch(share=True)

scripts/data_prepare.py ADDED Viewed

	@@ -0,0 +1,253 @@

+import zipfile
+import pandas as pd
+from datetime import datetime, timedelta
+import numpy as np
+from scipy.sparse import csr_matrix
+import math
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+import pytorch_lightning as pl
+def extract_ziped_data(ziped_data_path: str, extract_path : str):
+    """Extracts the contents of a zip file to a specified directory.
+    args:
+        ziped_data_path: str, path to the zip file
+        extract_path: str, path to the directory where contents will be extracted
+    """
+    # The directory where you want to extract the contents
+    extract_path = 'data'
+    # Open the zip file in read mode
+    with zipfile.ZipFile(ziped_data_path, 'r') as zip_ref:
+        # Extract all the contents into the specified directory
+        zip_ref.extractall(extract_path)
+    print(f"'{ziped_data_path}' has been extracted to '{extract_path}'")
+def prepare_data(data_folder='data/', val_days=7, test_days=7):
+    """
+    Loads, preprocesses, and splits the events data into train, validation, and test sets.
+    args:
+        data_folder: str, path to the folder containing 'events.csv'
+        val_days: int, number of days for the validation set
+        test_days: int, number of days for the test set
+    """
+    # --- Load Data ---
+    print(f"Loading events.csv from folder: {data_folder}")
+    try:
+        events_df = pd.read_csv(data_folder + 'events.csv')
+        print("Successfully loaded events.csv.")
+        events_df['timestamp_dt'] = pd.to_datetime(events_df['timestamp'], unit='ms')
+        print("\n--- Initial Data Summary ---")
+        print(f"Data shape: {events_df.shape}")
+        print(f"Full timeframe: {events_df['timestamp_dt'].min()} to {events_df['timestamp_dt'].max()}")
+        print("----------------------------\n")
+    except FileNotFoundError:
+        print(f"Error: 'events.csv' not found in '{data_folder}'. Please check the path.")
+        return None, None, None
+    # --- Split Data ---
+    sorted_df = events_df.sort_values('timestamp_dt').reset_index(drop=True)
+    print(f"Splitting data: {test_days} days for test, {val_days} for validation.")
+    end_time = sorted_df['timestamp_dt'].max()
+    test_start_time = end_time - timedelta(days=test_days)
+    val_start_time = test_start_time - timedelta(days=val_days)
+    test_df = sorted_df[sorted_df['timestamp_dt'] >= test_start_time]
+    val_df = sorted_df[(sorted_df['timestamp_dt'] >= val_start_time) & (sorted_df['timestamp_dt'] < test_start_time)]
+    train_df = sorted_df[sorted_df['timestamp_dt'] < val_start_time]
+    print("--- Data Splitting Summary ---")
+    print(f"Training set:   {train_df.shape[0]:>8} records | from {train_df['timestamp_dt'].min()} to {train_df['timestamp_dt'].max()}")
+    print(f"Validation set: {val_df.shape[0]:>8} records | from {val_df['timestamp_dt'].min()} to {val_df['timestamp_dt'].max()}")
+    print(f"Test set:       {test_df.shape[0]:>8} records | from {test_df['timestamp_dt'].min()} to {test_df['timestamp_dt'].max()}")
+    print("------------------------------")
+    return train_df, val_df, test_df
+class SASRecDataset(Dataset):
+    """
+    SASRec Dataset.
+    - Precomputes (sequence_id, cutoff_idx) pairs for O(1) __getitem__.
+    - Supports 'last' or 'all' target modes.
+    """
+    def __init__(self, sequences, max_len, target_mode="last"):
+        """
+        Args:
+            sequences: list of user sequences (list of item IDs).
+            max_len: maximum sequence length (padding applied).
+            target_mode: 'last' (only last prediction) or 'all' (predict at every step).
+        """
+        self.sequences = sequences
+        self.max_len = max_len
+        self.target_mode = target_mode
+        # Build index once
+        self.index = []
+        for seq_id, seq in enumerate(sequences):
+            for i in range(1, len(seq)):
+                self.index.append((seq_id, i))
+    def __len__(self):
+        return len(self.index)
+    def __getitem__(self, idx):
+        seq_id, cutoff = self.index[idx]
+        seq = self.sequences[seq_id][:cutoff]
+        # Truncate & pad
+        seq = seq[-self.max_len:]
+        pad_len = self.max_len - len(seq)
+        input_seq = np.zeros(self.max_len, dtype=np.int64)
+        input_seq[pad_len:] = seq
+        if self.target_mode == "last":
+            target = self.sequences[seq_id][cutoff]
+            return torch.LongTensor(input_seq), torch.LongTensor([target])
+        elif self.target_mode == "all":
+            # Predict next item at each step
+            target_seq = self.sequences[seq_id][1:cutoff+1]
+            target_seq = target_seq[-self.max_len:]
+            target = np.zeros(self.max_len, dtype=np.int64)
+            target[-len(target_seq):] = target_seq
+            return torch.LongTensor(input_seq), torch.LongTensor(target)
+class SASRecDataModule(pl.LightningDataModule):
+    """
+    PyTorch Lightning DataModule for preparing the RetailRocket dataset for the SASRec model.
+    This class handles all aspects of data preparation, including:
+    - Filtering out infrequent users and items to reduce noise.
+    - Building a consistent item vocabulary.
+    - Converting user event histories into sequential data.
+    - Creating and providing `DataLoader` instances for training, validation, and testing.
+    """
+    def __init__(self, train_df, val_df, test_df, min_item_interactions=5,
+                 min_user_interactions=5, max_len=50, batch_size=256):
+        """
+        Initializes the DataModule.
+        Args:
+            train_df (pd.DataFrame): DataFrame for training.
+            val_df (pd.DataFrame): DataFrame for validation.
+            test_df (pd.DataFrame): DataFrame for testing.
+            min_item_interactions (int): Minimum number of interactions for an item to be kept.
+            min_user_interactions (int): Minimum number of interactions for a user to be kept.
+            max_len (int): The maximum length of a user sequence fed to the model.
+            batch_size (int): The batch size for the DataLoaders.
+        """
+        super().__init__()
+        self.train_df = train_df
+        self.val_df = val_df
+        self.test_df = test_df
+        self.min_item_interactions = min_item_interactions
+        self.min_user_interactions = min_user_interactions
+        self.max_len = max_len
+        self.batch_size = batch_size
+        self.item_map = None
+        self.inverse_item_map = None
+        self.vocab_size = 0
+        self.user_history = None
+    def setup(self, stage=None):
+        """
+        Prepares the data for training, validation, and testing.
+        This method is called automatically by PyTorch Lightning. It performs the following steps:
+        1. Determines filtering criteria (which users and items to keep) based on the training set only
+           to prevent data leakage.
+        2. Applies these filters to the train, validation, and test sets.
+        3. Builds an item vocabulary (mapping item IDs to integer indices) from the combined
+           training and validation sets to ensure consistency for model checkpointing.
+        4. Converts the event logs into sequences of item indices for each user in each data split.
+        """
+        item_counts = self.train_df['itemid'].value_counts()
+        user_counts = self.train_df['visitorid'].value_counts()
+        items_to_keep = item_counts[item_counts >= self.min_item_interactions].index
+        users_to_keep = user_counts[user_counts >= self.min_user_interactions].index
+        self.filtered_train_df = self.train_df[
+            (self.train_df['itemid'].isin(items_to_keep)) &
+            (self.train_df['visitorid'].isin(users_to_keep))
+        ].copy()
+        self.filtered_val_df = self.val_df[
+            (self.val_df['itemid'].isin(items_to_keep)) &
+            (self.val_df['visitorid'].isin(users_to_keep))
+        ].copy()
+        self.filtered_test_df = self.test_df[
+            (self.test_df['itemid'].isin(items_to_keep)) &
+            (self.test_df['visitorid'].isin(users_to_keep))
+        ].copy()
+        all_known_items_df = pd.concat([self.filtered_train_df, self.filtered_val_df])
+        unique_items = all_known_items_df['itemid'].unique()
+        self.item_map = {item_id: i + 1 for i, item_id in enumerate(unique_items)}
+        self.inverse_item_map = {i: item_id for item_id, i in self.item_map.items()}
+        self.vocab_size = len(self.item_map) + 1 # +1 for padding token 0
+        self.user_history = self.filtered_train_df.groupby('visitorid')['itemid'].apply(list)
+        self.train_sequences = self._create_sequences(self.filtered_train_df)
+        self.val_sequences = self._create_sequences(self.filtered_val_df)
+        self.test_sequences = self._create_sequences(self.filtered_test_df)
+    def _create_sequences(self, df):
+        """
+        Helper function to convert a DataFrame of events into user interaction sequences.
+        Args:
+            df (pd.DataFrame): The input DataFrame to process.
+        Returns:
+            list[list[int]]: A list of user sequences, where each sequence is a list of item indices.
+        """
+        df_sorted = df.sort_values(['visitorid', 'timestamp_dt'])
+        sequences = df_sorted.groupby('visitorid')['itemid'].apply(
+            lambda x: [self.item_map[i] for i in x if i in self.item_map]
+        ).tolist()
+        return [s for s in sequences if len(s) > 1]
+    def train_dataloader(self):
+        """Creates the DataLoader for the training set."""
+        dataset = SASRecDataset(self.train_sequences, self.max_len)
+        return DataLoader(dataset, batch_size=self.batch_size, shuffle=True, num_workers=0)
+    def val_dataloader(self):
+        """Creates the DataLoader for the validation set."""
+        dataset = SASRecDataset(self.val_sequences, self.max_len)
+        return DataLoader(dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)
+    def test_dataloader(self):
+        """Creates the DataLoader for the test set."""
+        dataset = SASRecDataset(self.test_sequences, self.max_len)
+        return DataLoader(dataset, batch_size=self.batch_size, shuffle=False, num_workers=0)
+if __name__ == "__main__":
+    # --- Configuration ---
+    DATA_PATH = "data"
+    ZIPED_DATA_PATH = "data/archive.zip" # change to your zip file path
+    BATCH_SIZE = 256
+    MAX_TOKEN_LEN = 50     # 50–100 is standard for SASRec
+    # extract_ziped_data(ZIPED_DATA_PATH, DATA_PATH) # uncomment this line if you want to extract the data
+    # --- 1. Prepare the data into train, validation, and test sets ---
+    train_set, validation_set, test_set = prepare_data(data_folder=DATA_PATH)
+    # --- 2. Initialize DataModule ---
+    print("Initializing DataModule...")
+    datamodule = SASRecDataModule(
+        train_df=train_set,
+        val_df=validation_set,
+        test_df=test_set,
+        batch_size=BATCH_SIZE,
+        max_len=MAX_TOKEN_LEN
+    )
+    datamodule.setup()

scripts/main.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import pandas as pd
+import numpy as np
+import torch
+import datetime
+from models import SASRec
+from utils import prepare_ground_truth, calculate_metrics, load_item_properties, load_category_tree, get_popular_items, show_user_recommendations
+from data_prepare import prepare_data, SASRecDataset, SASRecDataModule
+def main(checkpoint_path="checkpoints/sasrec-epoch=06-val_hitrate@10=0.3629.ckpt", data_folder="data/"):
+    """
+    Main function to run the inference and qualitative analysis pipeline.
+    """
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+    print("Loading model from checkpoint...")
+    best_model = SASRec.load_from_checkpoint(checkpoint_path)
+    best_model.to(device)
+    print("Preparing data...")
+    train_set, validation_set, test_set = prepare_data(data_folder=data_folder)
+    datamodule = SASRecDataModule(train_set, validation_set, test_set)
+    datamodule.setup()
+    item_category_map = load_item_properties(data_folder=data_folder)
+    category_parent_map = load_category_tree(data_folder=data_folder)
+    print("\nCalculating popular items for cold-start users...")
+    popular_items_list = get_popular_items(train_set, k=10)
+    users_in_train_history = set(datamodule.user_history.keys())
+    users_in_test_set = set(datamodule.test_df['visitorid'].unique())
+    valid_example_users = list(users_in_train_history.intersection(users_in_test_set))
+    print(f"\nFound {len(valid_example_users)} users for qualitative analysis.")
+    for user_id in valid_example_users[:3]:
+        show_user_recommendations(user_id, best_model, datamodule, popular_items_list, item_category_map, category_parent_map)
+    new_user_id = -999
+    show_user_recommendations(new_user_id, best_model, datamodule, popular_items_list, item_category_map, category_parent_map)
+if __name__ == "__main__":
+    main()

scripts/models.py ADDED Viewed

	@@ -0,0 +1,408 @@

+import pandas as pd
+import numpy as np
+import math
+import torch
+import torch.nn as nn
+import pytorch_lightning as pl
+from scipy.sparse import csr_matrix
+from sklearn.metrics.pairwise import cosine_similarity
+import implicit
+from utils import prepare_ground_truth, calculate_metrics
+def recommend_popular_items_and_evaluate(train_df, test_df, k=10, prepare_ground_truth=None, calculate_metrics=None):
+    """
+    Trains a non-personalized Popularity model and evaluates its performance.
+    This model recommends the top-k most frequently transacted items from the training
+    set to every user. It serves as a simple but strong baseline.
+    Args:
+        train_df (pd.DataFrame): The training dataset.
+        test_df (pd.DataFrame): The test dataset for evaluation.
+        k (int): The number of items to recommend.
+        prepare_ground_truth (function): A function to process the test_df into a ground truth dict.
+        calculate_metrics (function): A function to compute ranking metrics.
+    Returns:
+        dict: A dictionary containing the calculated evaluation metrics (e.g., precision, recall).
+    """
+    print(f"\n--- Evaluating Popularity Model (Top {k} items) ---")
+    # 1. "Train" the model by finding the most popular items based on transactions
+    purchase_counts = train_df[train_df['event'] == 'transaction']['itemid'].value_counts()
+    popular_items = purchase_counts.head(k).index.tolist()
+    print(f"Top {k} popular items identified from training data.")
+    # 2. Evaluate the model
+    ground_truth = prepare_ground_truth(test_df)
+    # Every user receives the same list of popular items
+    recommendations = {user_id: popular_items for user_id in ground_truth.keys()}
+    metrics = calculate_metrics(recommendations, ground_truth, k)
+    print("Evaluation complete.")
+    return metrics
+def recommend_item_item_and_evaluate(train_df, test_df, k=10, min_item_interactions=5, min_user_interactions=5, prepare_ground_truth=None, calculate_metrics=None):
+    """
+    Trains an Item-Item Collaborative Filtering model and evaluates its performance.
+    This model recommends items that are similar to items a user has interacted
+    with in the past, based on co-occurrence patterns in the training data.
+    Args:
+        train_df (pd.DataFrame): The training dataset.
+        test_df (pd.DataFrame): The test dataset for evaluation.
+        k (int): The number of items to recommend.
+        min_item_interactions (int): Minimum number of interactions for an item to be kept.
+        min_user_interactions (int): Minimum number of interactions for a user to be kept.
+        prepare_ground_truth (function): A function to process the test_df into a ground truth dict.
+        calculate_metrics (function): A function to compute ranking metrics.
+    Returns:
+        dict: A dictionary containing the calculated evaluation metrics.
+    """
+    print(f"\n--- Evaluating Item-Item CF Model (Top {k} items) ---")
+    # 1. Filter out infrequent users and items to reduce noise and computation
+    item_counts = train_df['itemid'].value_counts()
+    user_counts = train_df['visitorid'].value_counts()
+    items_to_keep = item_counts[item_counts >= min_item_interactions].index
+    users_to_keep = user_counts[user_counts >= min_user_interactions].index
+    filtered_df = train_df[(train_df['itemid'].isin(items_to_keep)) & (train_df['visitorid'].isin(users_to_keep))].copy()
+    print(f"Filtered training data from {len(train_df)} to {len(filtered_df)} records.")
+    # 2. Create user-item interaction matrix and vocabulary mappings
+    user_map = {uid: i for i, uid in enumerate(filtered_df['visitorid'].unique())}
+    item_map = {iid: i for i, iid in enumerate(filtered_df['itemid'].unique())}
+    inverse_item_map = {i: iid for iid, i in item_map.items()}
+    user_indices = filtered_df['visitorid'].map(user_map)
+    item_indices = filtered_df['itemid'].map(item_map)
+    user_item_matrix = csr_matrix((np.ones(len(filtered_df)), (user_indices, item_indices)))
+    # 3. Calculate the cosine similarity matrix between all items
+    print("Calculating item similarity matrix...")
+    item_similarity_matrix = cosine_similarity(user_item_matrix.T, dense_output=False)
+    print("Similarity matrix calculated.")
+    # 4. Generate recommendations and evaluate
+    ground_truth = prepare_ground_truth(test_df)
+    recommendations = {}
+    print("Generating recommendations for users in test set...")
+    test_users = [u for u in ground_truth.keys() if u in user_map]
+    for user_id in test_users:
+        user_index = user_map[user_id]
+        user_interactions_indices = user_item_matrix[user_index].indices
+        if len(user_interactions_indices) > 0:
+            # Aggregate scores from items the user has interacted with
+            all_scores = np.asarray(item_similarity_matrix[user_interactions_indices].sum(axis=0)).flatten()
+            # Remove already interacted items from recommendations
+            all_scores[user_interactions_indices] = -1
+            top_indices = np.argsort(all_scores)[::-1][:k]
+            recs = [inverse_item_map[idx] for idx in top_indices if idx in inverse_item_map]
+            recommendations[user_id] = recs
+    metrics = calculate_metrics(recommendations, ground_truth, k)
+    print("Evaluation complete.")
+    return metrics
+def recommend_als_and_evaluate(train_df, test_df, k=10, min_item_interactions=5, min_user_interactions=5,
+                               factors=25, regularization=0.02, iterations=48, prepare_ground_truth=None, calculate_metrics=None):
+    """
+    Trains an Alternating Least Squares (ALS) model and evaluates its performance.
+    This model uses matrix factorization to learn latent embeddings for users and
+    items from implicit feedback data. Default hyperparameters are set from a
+    previous Optuna tuning process.
+    Args:
+        train_df (pd.DataFrame): The training dataset.
+        test_df (pd.DataFrame): The test dataset for evaluation.
+        k (int): The number of items to recommend.
+        min_item_interactions (int): Minimum number of interactions for an item to be kept.
+        min_user_interactions (int): Minimum number of interactions for a user to be kept.
+        factors (int): The number of latent factors to compute.
+        regularization (float): The regularization factor.
+        iterations (int): The number of ALS iterations to run.
+        prepare_ground_truth (function): A function to process the test_df into a ground truth dict.
+        calculate_metrics (function): A function to compute ranking metrics.
+    Returns:
+        dict: A dictionary containing the calculated evaluation metrics.
+    """
+    print(f"\n--- Evaluating ALS Model (Top {k} items) ---")
+    # 1. Filter data
+    item_counts = train_df['itemid'].value_counts()
+    user_counts = train_df['visitorid'].value_counts()
+    items_to_keep = item_counts[item_counts >= min_item_interactions].index
+    users_to_keep = user_counts[user_counts >= min_user_interactions].index
+    filtered_df = train_df[(train_df['itemid'].isin(items_to_keep)) & (train_df['visitorid'].isin(users_to_keep))].copy()
+    print(f"Filtered training data from {len(train_df)} to {len(filtered_df)} records.")
+    # 2. Create mappings and confidence matrix
+    user_map = {uid: i for i, uid in enumerate(filtered_df['visitorid'].unique())}
+    item_map = {iid: i for i, iid in enumerate(filtered_df['itemid'].unique())}
+    inverse_item_map = {i: iid for iid, i in item_map.items()}
+    user_indices = filtered_df['visitorid'].map(user_map).astype(np.int32)
+    item_indices = filtered_df['itemid'].map(item_map).astype(np.int32)
+    event_weights = {'view': 1, 'addtocart': 3, 'transaction': 5}
+    confidence = filtered_df['event'].map(event_weights).astype(np.float32)
+    user_item_matrix = csr_matrix((confidence, (user_indices, item_indices)))
+    # 3. Train the ALS model
+    print("Training ALS model...")
+    als_model = implicit.als.AlternatingLeastSquares(factors=factors, regularization=regularization, iterations=iterations)
+    als_model.fit(user_item_matrix)
+    print("ALS model trained.")
+    # 4. Generate recommendations and evaluate
+    ground_truth = prepare_ground_truth(test_df)
+    recommendations = {}
+    print("Generating recommendations for users in test set...")
+    test_users_indices = [user_map[u] for u in ground_truth.keys() if u in user_map]
+    if test_users_indices:
+        user_item_matrix_for_recs = user_item_matrix[test_users_indices]
+        ids, _ = als_model.recommend(test_users_indices, user_item_matrix_for_recs, N=k)
+        for i, user_index in enumerate(test_users_indices):
+            original_user_id = list(user_map.keys())[list(user_map.values()).index(user_index)]
+            recs = [inverse_item_map[item_idx] for item_idx in ids[i] if item_idx in inverse_item_map]
+            recommendations[original_user_id] = recs
+    metrics = calculate_metrics(recommendations, ground_truth, k)
+    print("Evaluation complete.")
+    return metrics
+class SASRec(pl.LightningModule):
+    """
+    A PyTorch Lightning implementation of the SASRec model for sequential recommendation.
+    SASRec (Self-Attentive Sequential Recommendation) uses a Transformer-based
+    architecture to capture the sequential patterns in a user's interaction history
+    to predict the next item they are likely to interact with.
+    Attributes:
+        save_hyperparameters: Automatically saves all constructor arguments as hyperparameters.
+        item_embedding (nn.Embedding): Embedding layer for item IDs.
+        positional_embedding (nn.Embedding): Embedding layer to encode the position of items in a sequence.
+        transformer_encoder (nn.TransformerEncoder): The core self-attention module.
+        fc (nn.Linear): Final fully connected layer to produce logits over the item vocabulary.
+        loss_fn (nn.CrossEntropyLoss): The loss function used for training.
+    """
+    def __init__(self, vocab_size, max_len, hidden_dim, num_heads, num_layers,
+                 dropout=0.2, learning_rate=1e-3, weight_decay=1e-6, warmup_steps=2000, max_steps=100000):
+        """
+        Initializes the SASRec model layers and hyperparameters.
+        Args:
+            vocab_size (int): The total number of unique items in the dataset (+1 for padding).
+            max_len (int): The maximum length of the input sequences.
+            hidden_dim (int): The dimensionality of the item and positional embeddings.
+            num_heads (int): The number of attention heads in the Transformer encoder.
+            num_layers (int): The number of layers in the Transformer encoder.
+            dropout (float): The dropout rate to be applied.
+            learning_rate (float): The learning rate for the optimizer.
+            weight_decay (float): The weight decay (L2 penalty) for the optimizer.
+            warmup_steps (int): The number of linear warmup steps for the learning rate scheduler.
+            max_steps (int): The total number of training steps for the learning rate scheduler's decay phase.
+        """
+        super().__init__()
+        # This saves all hyperparameters to self.hparams, making them accessible later
+        self.save_hyperparameters()
+        # Embedding layers
+        self.item_embedding = nn.Embedding(vocab_size, hidden_dim, padding_idx=0)
+        self.positional_embedding = nn.Embedding(max_len, hidden_dim)
+        self.dropout = nn.Dropout(dropout)
+        # Transformer Encoder
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=hidden_dim, nhead=num_heads, dim_feedforward=hidden_dim * 4,
+            dropout=dropout, batch_first=True, activation='gelu'
+        )
+        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
+        # Output layer
+        self.fc = nn.Linear(hidden_dim, vocab_size)
+        # Loss function, ignoring the padding token
+        self.loss_fn = nn.CrossEntropyLoss(ignore_index=0)
+        # Lists to store outputs from validation and test steps
+        self.validation_step_outputs = []
+        self.test_step_outputs = []
+    def forward(self, x):
+        """
+        Defines the forward pass of the model.
+        Args:
+            x (torch.Tensor): A batch of input sequences of shape (batch_size, seq_len).
+        Returns:
+            torch.Tensor: The output logits of shape (batch_size, seq_len, vocab_size).
+        """
+        seq_len = x.size(1)
+        # Create positional indices (0, 1, 2, ..., seq_len-1)
+        positions = torch.arange(seq_len, device=self.device).unsqueeze(0)
+        # Create a causal mask to ensure the model doesn't look ahead in the sequence
+        causal_mask = nn.Transformer.generate_square_subsequent_mask(seq_len, device=self.device)
+        # Combine item and positional embeddings
+        x = self.item_embedding(x) + self.positional_embedding(positions)
+        x = self.dropout(x)
+        # Pass through the Transformer encoder
+        x = self.transformer_encoder(x, mask=causal_mask)
+        # Get final logits
+        logits = self.fc(x)
+        return logits
+    def training_step(self, batch, batch_idx):
+        """
+        Performs a single training step.
+        Args:
+            batch (tuple): A tuple containing input sequences and target items.
+            batch_idx (int): The index of the current batch.
+        Returns:
+            torch.Tensor: The calculated loss for the batch.
+        """
+        inputs, targets = batch
+        logits = self.forward(inputs)
+        # We only care about the prediction for the very last item in the input sequence
+        last_logits = logits[:, -1, :]
+        # Calculate loss against the single target item
+        loss = self.loss_fn(last_logits, targets.squeeze())
+        self.log('train_loss', loss, prog_bar=True, on_step=True, on_epoch=True)
+        return loss
+    def validation_step(self, batch, batch_idx):
+        """
+        Performs a single validation step.
+        Calculates loss and stores predictions for metric computation at the end of the epoch.
+        """
+        inputs, targets = batch
+        logits = self.forward(inputs)
+        last_item_logits = logits[:, -1, :]
+        loss = self.loss_fn(last_item_logits, targets.squeeze())
+        self.log('val_loss', loss, prog_bar=True, on_epoch=True)
+        # Get top-10 predictions for metric calculation
+        top_k_preds = torch.topk(last_item_logits, 10, dim=-1).indices
+        self.validation_step_outputs.append({'preds': top_k_preds, 'targets': targets})
+        return loss
+    def on_validation_epoch_end(self):
+        """
+        Calculates and logs ranking metrics at the end of the validation epoch.
+        """
+        if not self.validation_step_outputs: return
+        # Concatenate all predictions and targets from the epoch
+        preds = torch.cat([x['preds'] for x in self.validation_step_outputs], dim=0)
+        targets = torch.cat([x['targets'] for x in self.validation_step_outputs], dim=0)
+        k = preds.size(1)
+        # Check if the target is in the top-k predictions for each example
+        hits_tensor = (preds == targets).any(dim=1)
+        num_hits = hits_tensor.sum().item()
+        num_targets = len(targets)
+        if num_targets > 0:
+            hit_rate = num_hits / num_targets
+            recall = hit_rate  # For next-item prediction, recall@k is the same as hit_rate@k
+            precision = num_hits / (k * num_targets)
+        else:
+            hit_rate, recall, precision = 0.0, 0.0, 0.0
+        self.log('val_hitrate@10', hit_rate, prog_bar=True)
+        self.log('val_precision@10', precision, prog_bar=True)
+        self.log('val_recall@10', recall, prog_bar=True)
+        self.validation_step_outputs.clear() # Free up memory
+    def test_step(self, batch, batch_idx):
+        """
+        Performs a single test step.
+        Mirrors the logic of the validation_step.
+        """
+        inputs, targets = batch
+        logits = self.forward(inputs)
+        last_item_logits = logits[:, -1, :]
+        loss = self.loss_fn(last_item_logits, targets.squeeze())
+        self.log('test_loss', loss, prog_bar=True)
+        top_k_preds = torch.topk(last_item_logits, 10, dim=-1).indices
+        self.test_step_outputs.append({'preds': top_k_preds, 'targets': targets})
+        return loss
+    def on_test_epoch_end(self):
+        """
+        Calculates and logs ranking metrics at the end of the test epoch.
+        """
+        if not self.test_step_outputs: return
+        preds = torch.cat([x['preds'] for x in self.test_step_outputs], dim=0)
+        targets = torch.cat([x['targets'] for x in self.test_step_outputs], dim=0)
+        k = preds.size(1)
+        hits_tensor = (preds == targets).any(dim=1)
+        num_hits = hits_tensor.sum().item()
+        num_targets = len(targets)
+        if num_targets > 0:
+            hit_rate = num_hits / num_targets
+            recall = hit_rate
+            precision = num_hits / (k * num_targets)
+        else:
+            hit_rate, recall, precision = 0.0, 0.0, 0.0
+        self.log('test_hitrate@10', hit_rate, prog_bar=True)
+        self.log('test_precision@10', precision, prog_bar=True)
+        self.log('test_recall@10', recall, prog_bar=True)
+        self.test_step_outputs.clear() # Free up memory
+    def configure_optimizers(self):
+        """
+        Configures the optimizer and learning rate scheduler.
+        Uses AdamW optimizer and a linear warmup followed by a cosine decay schedule,
+        which is a standard practice for training Transformer models.
+        """
+        optimizer = torch.optim.AdamW(
+            self.parameters(),
+            lr=self.hparams.learning_rate,
+            weight_decay=self.hparams.weight_decay
+        )
+        # Learning rate scheduler: linear warmup and cosine decay
+        def lr_lambda(current_step: int):
+            warmup_steps = self.hparams.warmup_steps
+            max_steps = self.hparams.max_steps
+            if current_step < warmup_steps:
+                return float(current_step) / float(max(1, warmup_steps))
+            progress = float(current_step - warmup_steps) / float(max(1, max_steps - warmup_steps))
+            return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
+        scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+        return {
+            "optimizer": optimizer,
+            "lr_scheduler": {
+                "scheduler": scheduler,
+                "interval": "step",  # Update the scheduler at every training step
+                "frequency": 1
+            }
+        }

scripts/train_and_eval.py ADDED Viewed

	@@ -0,0 +1,172 @@

+import pandas as pd
+import numpy as np
+import pytorch_lightning as pl
+from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
+from pytorch_lightning.loggers import TensorBoardLogger
+import torch
+from utils import prepare_ground_truth, calculate_metrics
+from data_prepare import prepare_data, SASRecDataset, SASRecDataModule
+from models import recommend_popular_items_and_evaluate, recommend_item_item_and_evaluate, recommend_als_and_evaluate, SASRec
+def train_and_eval_SASRec_model(train_set, validation_set, test_set, checkpoint_dir_path='checkpoints/',
+                                checkpoint_path=None, n_epochs=10, mode='train',
+                                batchsize=256, max_token_len=50, learning_rate=1e-3, hidden_dim=128,
+                                num_heads=2, num_layers=2, dropout=0.2, weight_decay=1e-6):
+    """
+    Train or evaluate a SASRec sequential recommendation model using PyTorch Lightning.
+    This function wraps the entire SASRec pipeline:
+    - Initializes the SASRecDataModule (handles dataset preprocessing and dataloaders).
+    - Builds the SASRec Transformer-based model.
+    - Configures training callbacks (checkpointing, early stopping, LR monitoring).
+    - Runs either training (`mode='train'`) or evaluation on the test set (`mode='test'`).
+    Args
+    ----------
+    train_set : pd.DataFrame
+        Training interactions dataset .
+    validation_set : pd.DataFrame
+        Validation dataset with the same structure as `train_set`.
+    test_set : pd.DataFrame
+        Test dataset with the same structure as `train_set`.
+    checkpoint_dir_path : str, optional (default='checkpoints/')
+        Directory to save model checkpoints.
+    checkpoint_path : str or None, optional (default=None)
+        Path to a checkpoint file for resuming training or loading a pretrained model for testing.
+    n_epochs : int, optional (default=10)
+        Number of training epochs.
+    mode : {'train', 'test'}, optional (default='train')
+        - `'train'`: trains the model on the training/validation data.
+        - `'test'`: evaluates the model on the test set using a checkpoint.
+    batchsize : int, optional (default=256)
+        Batch size for training and evaluation.
+    max_token_len : int, optional (default=50)
+        Maximum sequence length per user (recent interactions kept).
+    learning_rate : float, optional (default=1e-3)
+        Learning rate for the AdamW optimizer.
+    hidden_dim : int, optional (default=128)
+        Dimensionality of item and positional embeddings.
+    num_heads : int, optional (default=2)
+        Number of attention heads in each Transformer encoder layer.
+    num_layers : int, optional (default=2)
+        Number of Transformer encoder layers.
+    dropout : float, optional (default=0.2)
+        Dropout probability applied in embeddings and Transformer layers.
+    weight_decay : float, optional (default=1e-6)
+        Weight decay regularization coefficient for AdamW.
+    """
+    # --- 1. Initialize DataModule ---
+    print("Initializing DataModule...")
+    datamodule = SASRecDataModule(
+        train_df=train_set,
+        val_df=validation_set,
+        test_df=test_set,
+        batch_size=batchsize,
+        max_len=max_token_len
+    )
+    datamodule.setup()
+    # --- 2. Initialize Model ---
+    print("Initializing SASRec model...")
+    model = SASRec(
+        vocab_size=datamodule.vocab_size,
+        max_len=max_token_len,
+        hidden_dim=hidden_dim,
+        num_heads=num_heads,
+        num_layers=num_layers,
+        dropout=dropout,
+        learning_rate=learning_rate,
+        weight_decay=weight_decay
+    )
+    # --- 3. Configure Training Callbacks ---
+    checkpoint_callback = ModelCheckpoint(
+        dirpath=checkpoint_dir_path,
+        filename="sasrec-{epoch:02d}-{val_hitrate@10:.4f}",
+        save_top_k=1,
+        verbose=True,
+        monitor="val_hitrate@10",
+        mode="max"
+    )
+    early_stopping_callback = EarlyStopping(
+        monitor="val_hitrate@10",   # stop if ranking metric stagnates
+        patience=5,
+        mode="max"
+    )
+    lr_monitor = LearningRateMonitor(logging_interval="step")
+    logger = TensorBoardLogger("lightning_logs", name="sasrec")
+    # --- 4. Initialize Trainer ---
+    print("Initializing PyTorch Lightning Trainer...")
+    trainer = pl.Trainer(
+        logger=logger,
+        callbacks=[checkpoint_callback, early_stopping_callback, lr_monitor],
+        max_epochs=n_epochs,
+        accelerator='auto',
+        devices=1,
+        gradient_clip_val=1.0,  # helps with exploding gradients
+    )
+    if mode == 'train' :
+        # --- 5. Start Training ---
+        print(f"Starting training for up to {n_epochs} epochs...")
+        trainer.fit(model, datamodule,
+                    ckpt_path=checkpoint_path
+                    )
+    elif mode == 'test':
+        # --- 6. Test on best checkpoint ---
+        print("Evaluating on test set...")
+        trainer.test(model, datamodule,
+                     ckpt_path=checkpoint_path
+                    )
+# --- Main Execution Block ---
+if __name__ == "__main__":
+    # --- Configuration ---
+    BATCH_SIZE = 256
+    MAX_TOKEN_LEN = 50     # 50–100 is standard
+    LEARNING_RATE = 1e-3
+    HIDDEN_DIM = 128
+    NUM_HEADS = 2
+    NUM_LAYERS = 2
+    DROPOUT = 0.2
+    WEIGHT_DECAY = 1e-6
+    N_EPOCHS = 50
+    CHECKPOINT_SAVE_PATH = 'checkpoints/'
+    CHECKPOINT_LOAD_PATH = None  # or specify a path to a checkpoint file
+    MODE = 'train'  # 'train' or 'test'
+    train_set, validation_set, test_set = prepare_data(data_folder='data/')
+    if train_set is not None:
+        results = {}
+        full_train_set = pd.concat([train_set, validation_set])
+        # Evaluate classical models
+        print("\n>>> Running evaluations on the VALIDATION set <<<")
+        results['Popularity (Validation)'] = recommend_popular_items_and_evaluate(train_set, validation_set)
+        results['Item-Item CF (Validation)'] = recommend_item_item_and_evaluate(train_set, validation_set)
+        results['ALS (Validation)'] = recommend_als_and_evaluate(train_set, validation_set)
+        print("\n>>> Running final evaluations on the TEST set <<<")
+        results['Popularity (Test)'] = recommend_popular_items_and_evaluate(full_train_set, test_set)
+        results['Item-Item CF (Test)'] = recommend_item_item_and_evaluate(full_train_set, test_set)
+        results['ALS (Test)'] = recommend_als_and_evaluate(full_train_set, test_set)
+        print("\n--- Final Evaluation Results ---")
+        results_df = pd.DataFrame.from_dict(results, orient='index')
+        print(results_df)
+        print("--------------------------------")
+        # Train and evaluate SASRec model
+        print("\n>>> Training and evaluating SASRec model <<<")
+        train_and_eval_SASRec_model(train_set, validation_set, test_set, n_epochs=10, mode='train')
+        print("\n>>> Evaluating trained SASRec model on TEST set <<<")
+        train_and_eval_SASRec_model(train_set, validation_set, test_set, mode='test')

scripts/utils.py ADDED Viewed

	@@ -0,0 +1,196 @@

+import pandas as pd
+import numpy as np
+import torch
+import datetime
+def calculate_metrics(recommendations_dict, ground_truth_dict, k):
+    """
+    Calculates Precision@k, Recall@k, and HitRate@k.
+    args:
+    ----------
+    recommendations_dict : {user_id: [recommended_item_ids]}
+    ground_truth_dict : {user_id: set of ground truth item_ids}
+    k : int
+    Returns
+    -------
+    dict with mean precision, recall, and hit rate
+    """
+    all_precisions, all_recalls, all_hits = [], [], []
+    for user_id, true_items in ground_truth_dict.items():
+        recs = recommendations_dict.get(user_id, [])[:k]
+        if not true_items:
+            continue
+        hits = len(set(recs) & true_items)
+        precision = hits / k if k > 0 else 0
+        recall = hits / len(true_items)
+        hit_rate = 1.0 if hits > 0 else 0.0
+        all_precisions.append(precision)
+        all_recalls.append(recall)
+        all_hits.append(hit_rate)
+    if not all_precisions:
+        return {"mean_precision@k": 0, "mean_recall@k": 0, "mean_hitrate@k": 0}
+    return {
+        "mean_precision@k": np.mean(all_precisions),
+        "mean_recall@k": np.mean(all_recalls),
+        "mean_hitrate@k": np.mean(all_hits)
+    }
+def prepare_ground_truth(df, mode="purchase", event_weights=None):
+    """
+    Prepares ground truth dictionaries for evaluation.
+    Parameters
+    ----------
+    df : pd.DataFrame
+        Test dataframe containing at least ['visitorid', 'itemid', 'event'].
+    mode : str, default="purchase"
+        - "purchase" : Only use transactions as ground truth.
+        - "all"      : Use all events. Optionally weight them.
+    event_weights : dict, optional
+        Example: {"view": 1, "addtocart": 3, "transaction": 5}.
+        Used only if mode == "all".
+    Returns
+    -------
+    dict : {user_id: set of item_ids}
+    """
+    if mode == "purchase":
+        df_filtered = df[df["event"] == "transaction"]
+        ground_truth = df_filtered.groupby("visitorid")["itemid"].apply(set).to_dict()
+    elif mode == "all":
+        if event_weights is None:
+            # Default: treat all events equally
+            ground_truth = df.groupby("visitorid")["itemid"].apply(set).to_dict()
+        else:
+            # Weighted ground truth (for more advanced eval)
+            ground_truth = {}
+            for uid, user_df in df.groupby("visitorid"):
+                weighted_items = []
+                for _, row in user_df.iterrows():
+                    weight = event_weights.get(row["event"], 1)
+                    weighted_items.extend([row["itemid"]] * weight)
+                ground_truth[uid] = set(weighted_items)
+    else:
+        raise ValueError("mode must be 'purchase' or 'all'")
+    return ground_truth
+def load_item_properties(data_folder='data/'):
+    """
+    Loads item properties and creates a mapping from item ID to its category ID.
+    Handles both a single properties file or two split parts.
+    Args:
+        data_folder (str): The path to the folder containing item property files.
+    Returns:
+        dict: A dictionary mapping {itemid: categoryid}.
+    """
+    print("Loading item properties...")
+    try:
+        # First, try to load the two separate parts and combine them.
+        props_df_part1 = pd.read_csv(data_folder + 'item_properties_part1.csv')
+        props_df_part2 = pd.read_csv(data_folder + 'item_properties_part2.csv')
+        props_df = pd.concat([props_df_part1, props_df_part2], ignore_index=True)
+        print("Successfully loaded and combined item_properties_part1.csv and item_properties_part2.csv.")
+    except FileNotFoundError:
+        try:
+            # If the parts are not found, try to load a single combined file.
+            props_df = pd.read_csv(data_folder + 'item_properties.csv')
+            print("Successfully loaded a single item_properties.csv.")
+        except FileNotFoundError:
+            print(f"Warning: No item properties files found. Cannot display category information.")
+            return {}
+    category_df = props_df[props_df['property'] == 'categoryid'].copy()
+    category_df['value'] = pd.to_numeric(category_df['value'], errors='coerce').astype('Int64')
+    item_to_category_map = category_df.set_index('itemid')['value'].to_dict()
+    print("Item to category mapping created successfully.")
+    return item_to_category_map
+def load_category_tree(data_folder='data/'):
+    """
+    Loads the category tree to map categories to their parent categories.
+    Args:
+        data_folder (str): The path to the folder containing category_tree.csv.
+    Returns:
+        dict: A dictionary mapping {categoryid: parentid}.
+    """
+    print("Loading category tree...")
+    try:
+        tree_df = pd.read_csv(data_folder + 'category_tree.csv')
+        category_parent_map = tree_df.set_index('categoryid')['parentid'].to_dict()
+        print("Category tree loaded successfully.")
+        return category_parent_map
+    except FileNotFoundError:
+        print("Warning: 'category_tree.csv' not found. Cannot display parent category information.")
+        return {}
+def get_popular_items(train_df, k=10):
+    """
+    Calculates the top-k most popular items based on transaction count.
+    """
+    purchase_counts = train_df[train_df['event'] == 'transaction']['itemid'].value_counts()
+    return purchase_counts.head(k).index.tolist()
+def show_user_recommendations(visitor_id, model, datamodule, popular_items, item_category_map, category_parent_map, k=10):
+    """
+    Displays recommendations for a user, including category and parent category information.
+    """
+    print(f"\n--- Recommendations for Visitor ID: {visitor_id} ---")
+    model.eval()
+    def format_item_with_category(item_id):
+        category_id = item_category_map.get(item_id, 'N/A')
+        parent_id = category_parent_map.get(category_id, 'N/A') if category_id != 'N/A' else 'N/A'
+        return f"Item: {item_id} (Category: {category_id}, Parent: {parent_id})"
+    user_history_ids = datamodule.user_history.get(visitor_id)
+    if user_history_ids is None:
+        print(f"User {visitor_id} not found in training history. Providing popularity-based recommendations.")
+        print(f"\nTop {k} Popular Items (Fallback):")
+        recs_with_cats = [format_item_with_category(item_id) for item_id in popular_items]
+        print(recs_with_cats)
+        print("-------------------------------------------------")
+        return
+    history_with_cats = [format_item_with_category(item_id) for item_id in user_history_ids]
+    print(f"User's Historical Interactions:")
+    print(history_with_cats)
+    history_indices = [datamodule.item_map[i] for i in user_history_ids if i in datamodule.item_map]
+    if not history_indices:
+        print("None of the user's historical items are in the model's vocabulary.")
+        return
+    max_len = datamodule.max_len
+    input_seq = history_indices[-max_len:]
+    padded_input = np.zeros(max_len, dtype=np.int64)
+    padded_input[-len(input_seq):] = input_seq
+    input_tensor = torch.LongTensor(np.array([padded_input]))
+    input_tensor = input_tensor.to(model.device)
+    with torch.no_grad():
+        logits = model(input_tensor)
+        last_item_logits = logits[0, -1, :]
+        top_indices = torch.topk(last_item_logits, k).indices.tolist()
+    recommended_item_ids = [datamodule.inverse_item_map[idx] for idx in top_indices if idx in datamodule.inverse_item_map]
+    print(f"\nTop {k} Recommended Items:")
+    recs_with_cats = [format_item_with_category(item_id) for item_id in recommended_item_ids]
+    print(recs_with_cats)
+    print("-------------------------------------------------")