Spaces:

MCP-1st-Birthday
/

ML-Starter

Running

App Files Files Community

ML-Starter / knowledge_base /structured_data /movielens_recommendations_transformers.py

emreatilgan

feat: Initialize mcp_server with embedding and loader modules

9ce984a 19 days ago

raw

history blame contribute delete

20.5 kB

	"""
	Title: A Transformer-based recommendation system
	Author: [Khalid Salama](https://www.linkedin.com/in/khalid-salama-24403144/)
	Date created: 2020/12/30
	Last modified: 2025/01/27
	Description: Rating rate prediction using the Behavior Sequence Transformer (BST) model on the Movielens.
	Accelerator: GPU
	Made backend-agnostic by: [Humbulani Ndou](https://github.com/Humbulani1234)
	"""

	"""
	## Introduction

	This example demonstrates the [Behavior Sequence Transformer (BST)](https://arxiv.org/abs/1905.06874)
	model, by Qiwei Chen et al., using the [Movielens dataset](https://grouplens.org/datasets/movielens/).
	The BST model leverages the sequential behaviour of the users in watching and rating movies,
	as well as user profile and movie features, to predict the rating of the user to a target movie.

	More precisely, the BST model aims to predict the rating of a target movie by accepting
	the following inputs:

	1. A fixed-length sequence of `movie_ids` watched by a user.
	2. A fixed-length sequence of the `ratings` for the movies watched by a user.
	3. A set of user features, including `user_id`, `sex`, `occupation`, and `age_group`.
	4. A set of `genres` for each movie in the input sequence and the target movie.
	5. A `target_movie_id` for which to predict the rating.

	This example modifies the original BST model in the following ways:

	1. We incorporate the movie features (genres) into the processing of the embedding of each
	movie of the input sequence and the target movie, rather than treating them as "other features"
	outside the transformer layer.
	2. We utilize the ratings of movies in the input sequence, along with the their positions
	in the sequence, to update them before feeding them into the self-attention layer.


	Note that this example should be run with TensorFlow 2.4 or higher.
	"""

	"""
	## The dataset

	We use the [1M version of the Movielens dataset](https://grouplens.org/datasets/movielens/1m/).
	The dataset includes around 1 million ratings from 6000 users on 4000 movies,
	along with some user features, movie genres. In addition, the timestamp of each user-movie
	rating is provided, which allows creating sequences of movie ratings for each user,
	as expected by the BST model.
	"""

	"""
	## Setup
	"""

	import os

	os.environ["KERAS_BACKEND"] = "jax" # or torch, or tensorflow

	import math
	from zipfile import ZipFile
	from urllib.request import urlretrieve
	import numpy as np
	import pandas as pd

	import keras
	from keras import layers, ops
	from keras.layers import StringLookup

	"""
	## Prepare the data

	### Download and prepare the DataFrames

	First, let's download the movielens data.

	The downloaded folder will contain three data files: `users.dat`, `movies.dat`,
	and `ratings.dat`.
	"""

	urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "movielens.zip")
	ZipFile("movielens.zip", "r").extractall()

	"""
	Then, we load the data into pandas DataFrames with their proper column names.
	"""

	users = pd.read_csv(
	"ml-1m/users.dat",
	sep="::",
	names=["user_id", "sex", "age_group", "occupation", "zip_code"],
	encoding="ISO-8859-1",
	engine="python",
	)

	ratings = pd.read_csv(
	"ml-1m/ratings.dat",
	sep="::",
	names=["user_id", "movie_id", "rating", "unix_timestamp"],
	encoding="ISO-8859-1",
	engine="python",
	)

	movies = pd.read_csv(
	"ml-1m/movies.dat",
	sep="::",
	names=["movie_id", "title", "genres"],
	encoding="ISO-8859-1",
	engine="python",
	)

	"""
	Here, we do some simple data processing to fix the data types of the columns.
	"""

	users["user_id"] = users["user_id"].apply(lambda x: f"user_{x}")
	users["age_group"] = users["age_group"].apply(lambda x: f"group_{x}")
	users["occupation"] = users["occupation"].apply(lambda x: f"occupation_{x}")

	movies["movie_id"] = movies["movie_id"].apply(lambda x: f"movie_{x}")

	ratings["movie_id"] = ratings["movie_id"].apply(lambda x: f"movie_{x}")
	ratings["user_id"] = ratings["user_id"].apply(lambda x: f"user_{x}")
	ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

	"""
	Each movie has multiple genres. We split them into separate columns in the `movies`
	DataFrame.
	"""

	genres = ["Action", "Adventure", "Animation", "Children's", "Comedy", "Crime"]
	genres += ["Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical"]
	genres += ["Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

	for genre in genres:
	movies[genre] = movies["genres"].apply(
	lambda values: int(genre in values.split("\|"))
	)


	"""
	### Transform the movie ratings data into sequences

	First, let's sort the the ratings data using the `unix_timestamp`, and then group the
	`movie_id` values and the `rating` values by `user_id`.

	The output DataFrame will have a record for each `user_id`, with two ordered lists
	(sorted by rating datetime): the movies they have rated, and their ratings of these movies.
	"""

	ratings_group = ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")

	ratings_data = pd.DataFrame(
	data={
	"user_id": list(ratings_group.groups.keys()),
	"movie_ids": list(ratings_group.movie_id.apply(list)),
	"ratings": list(ratings_group.rating.apply(list)),
	"timestamps": list(ratings_group.unix_timestamp.apply(list)),
	}
	)


	"""
	Now, let's split the `movie_ids` list into a set of sequences of a fixed length.
	We do the same for the `ratings`. Set the `sequence_length` variable to change the length
	of the input sequence to the model. You can also change the `step_size` to control the
	number of sequences to generate for each user.
	"""

	sequence_length = 4
	step_size = 2


	def create_sequences(values, window_size, step_size):
	sequences = []
	start_index = 0
	while True:
	end_index = start_index + window_size
	seq = values[start_index:end_index]
	if len(seq) < window_size:
	seq = values[-window_size:]
	if len(seq) == window_size:
	sequences.append(seq)
	break
	sequences.append(seq)
	start_index += step_size
	return sequences


	ratings_data.movie_ids = ratings_data.movie_ids.apply(
	lambda ids: create_sequences(ids, sequence_length, step_size)
	)

	ratings_data.ratings = ratings_data.ratings.apply(
	lambda ids: create_sequences(ids, sequence_length, step_size)
	)

	del ratings_data["timestamps"]

	"""
	After that, we process the output to have each sequence in a separate records in
	the DataFrame. In addition, we join the user features with the ratings data.
	"""

	ratings_data_movies = ratings_data[["user_id", "movie_ids"]].explode(
	"movie_ids", ignore_index=True
	)
	ratings_data_rating = ratings_data[["ratings"]].explode("ratings", ignore_index=True)
	ratings_data_transformed = pd.concat([ratings_data_movies, ratings_data_rating], axis=1)
	ratings_data_transformed = ratings_data_transformed.join(
	users.set_index("user_id"), on="user_id"
	)
	ratings_data_transformed.movie_ids = ratings_data_transformed.movie_ids.apply(
	lambda x: ",".join(x)
	)
	ratings_data_transformed.ratings = ratings_data_transformed.ratings.apply(
	lambda x: ",".join([str(v) for v in x])
	)

	del ratings_data_transformed["zip_code"]

	ratings_data_transformed.rename(
	columns={"movie_ids": "sequence_movie_ids", "ratings": "sequence_ratings"},
	inplace=True,
	)

	"""
	With `sequence_length` of 4 and `step_size` of 2, we end up with 498,623 sequences.

	Finally, we split the data into training and testing splits, with 85% and 15% of
	the instances, respectively, and store them to CSV files.
	"""

	random_selection = np.random.rand(len(ratings_data_transformed.index)) <= 0.85
	train_data = ratings_data_transformed[random_selection]
	test_data = ratings_data_transformed[~random_selection]

	train_data.to_csv("train_data.csv", index=False, sep="\|", header=False)
	test_data.to_csv("test_data.csv", index=False, sep="\|", header=False)

	"""
	## Define metadata
	"""

	CSV_HEADER = list(ratings_data_transformed.columns)

	CATEGORICAL_FEATURES_WITH_VOCABULARY = {
	"user_id": list(users.user_id.unique()),
	"movie_id": list(movies.movie_id.unique()),
	"sex": list(users.sex.unique()),
	"age_group": list(users.age_group.unique()),
	"occupation": list(users.occupation.unique()),
	}

	USER_FEATURES = ["sex", "age_group", "occupation"]

	MOVIE_FEATURES = ["genres"]


	"""
	## Encode input features

	The `encode_input_features` function works as follows:

	1. Each categorical user feature is encoded using `layers.Embedding`, with embedding
	dimension equals to the square root of the vocabulary size of the feature.
	The embeddings of these features are concatenated to form a single input tensor.

	2. Each movie in the movie sequence and the target movie is encoded `layers.Embedding`,
	where the dimension size is the square root of the number of movies.

	3. A multi-hot genres vector for each movie is concatenated with its embedding vector,
	and processed using a non-linear `layers.Dense` to output a vector of the same movie
	embedding dimensions.

	4. A positional embedding is added to each movie embedding in the sequence, and then
	multiplied by its rating from the ratings sequence.

	5. The target movie embedding is concatenated to the sequence movie embeddings, producing
	a tensor with the shape of `[batch size, sequence length, embedding size]`, as expected
	by the attention layer for the transformer architecture.

	6. The method returns a tuple of two elements: `encoded_transformer_features` and
	`encoded_other_features`.
	"""

	# Required for tf.data.Dataset
	import tensorflow as tf


	def get_dataset_from_csv(csv_file_path, batch_size, shuffle=True):

	def process(features):
	movie_ids_string = features["sequence_movie_ids"]
	sequence_movie_ids = tf.strings.split(movie_ids_string, ",").to_tensor()
	# The last movie id in the sequence is the target movie.
	features["target_movie_id"] = sequence_movie_ids[:, -1]
	features["sequence_movie_ids"] = sequence_movie_ids[:, :-1]
	# Sequence ratings
	ratings_string = features["sequence_ratings"]
	sequence_ratings = tf.strings.to_number(
	tf.strings.split(ratings_string, ","), tf.dtypes.float32
	).to_tensor()
	# The last rating in the sequence is the target for the model to predict.
	target = sequence_ratings[:, -1]
	features["sequence_ratings"] = sequence_ratings[:, :-1]

	def encoding_helper(feature_name):

	# This are target_movie_id and sequence_movie_ids and they have the same
	# vocabulary as movie_id.
	if feature_name not in CATEGORICAL_FEATURES_WITH_VOCABULARY:
	vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY["movie_id"]
	index_lookup = StringLookup(
	vocabulary=vocabulary, mask_token=None, num_oov_indices=0
	)
	# Convert the string input values into integer indices.
	value_index = index_lookup(features[feature_name])
	features[feature_name] = value_index
	else:
	# movie_id is not part of the features, hence not processed. It was mainly required
	# for its vocabulary above.
	if feature_name == "movie_id":
	pass
	else:
	vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
	index_lookup = StringLookup(
	vocabulary=vocabulary, mask_token=None, num_oov_indices=0
	)
	# Convert the string input values into integer indices.
	value_index = index_lookup(features[feature_name])
	features[feature_name] = value_index

	# Encode the user features
	for feature_name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
	encoding_helper(feature_name)
	# Encoding target_movie_id and returning it as the target variable
	encoding_helper("target_movie_id")
	# Encoding sequence movie_ids.
	encoding_helper("sequence_movie_ids")
	return dict(features), target

	dataset = tf.data.experimental.make_csv_dataset(
	csv_file_path,
	batch_size=batch_size,
	column_names=CSV_HEADER,
	num_epochs=1,
	header=False,
	field_delim="\|",
	shuffle=shuffle,
	).map(process)
	return dataset


	def encode_input_features(
	inputs,
	include_user_id,
	include_user_features,
	include_movie_features,
	):
	encoded_transformer_features = []
	encoded_other_features = []

	other_feature_names = []
	if include_user_id:
	other_feature_names.append("user_id")
	if include_user_features:
	other_feature_names.extend(USER_FEATURES)

	## Encode user features
	for feature_name in other_feature_names:
	vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
	# Compute embedding dimensions
	embedding_dims = int(math.sqrt(len(vocabulary)))
	# Create an embedding layer with the specified dimensions.
	embedding_encoder = layers.Embedding(
	input_dim=len(vocabulary),
	output_dim=embedding_dims,
	name=f"{feature_name}_embedding",
	)
	# Convert the index values to embedding representations.
	encoded_other_features.append(embedding_encoder(inputs[feature_name]))

	## Create a single embedding vector for the user features
	if len(encoded_other_features) > 1:
	encoded_other_features = layers.concatenate(encoded_other_features)
	elif len(encoded_other_features) == 1:
	encoded_other_features = encoded_other_features[0]
	else:
	encoded_other_features = None

	## Create a movie embedding encoder
	movie_vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY["movie_id"]
	movie_embedding_dims = int(math.sqrt(len(movie_vocabulary)))
	# Create an embedding layer with the specified dimensions.
	movie_embedding_encoder = layers.Embedding(
	input_dim=len(movie_vocabulary),
	output_dim=movie_embedding_dims,
	name=f"movie_embedding",
	)
	# Create a vector lookup for movie genres.
	genre_vectors = movies[genres].to_numpy()
	movie_genres_lookup = layers.Embedding(
	input_dim=genre_vectors.shape[0],
	output_dim=genre_vectors.shape[1],
	embeddings_initializer=keras.initializers.Constant(genre_vectors),
	trainable=False,
	name="genres_vector",
	)
	# Create a processing layer for genres.
	movie_embedding_processor = layers.Dense(
	units=movie_embedding_dims,
	activation="relu",
	name="process_movie_embedding_with_genres",
	)

	## Define a function to encode a given movie id.
	def encode_movie(movie_id):
	# Convert the string input values into integer indices.
	movie_embedding = movie_embedding_encoder(movie_id)
	encoded_movie = movie_embedding
	if include_movie_features:
	movie_genres_vector = movie_genres_lookup(movie_id)
	encoded_movie = movie_embedding_processor(
	layers.concatenate([movie_embedding, movie_genres_vector])
	)
	return encoded_movie

	## Encoding target_movie_id
	target_movie_id = inputs["target_movie_id"]
	encoded_target_movie = encode_movie(target_movie_id)

	## Encoding sequence movie_ids.
	sequence_movies_ids = inputs["sequence_movie_ids"]
	encoded_sequence_movies = encode_movie(sequence_movies_ids)
	# Create positional embedding.
	position_embedding_encoder = layers.Embedding(
	input_dim=sequence_length,
	output_dim=movie_embedding_dims,
	name="position_embedding",
	)
	positions = ops.arange(start=0, stop=sequence_length - 1, step=1)
	encodded_positions = position_embedding_encoder(positions)
	# Retrieve sequence ratings to incorporate them into the encoding of the movie.
	sequence_ratings = inputs["sequence_ratings"]
	sequence_ratings = ops.expand_dims(sequence_ratings, -1)
	# Add the positional encoding to the movie encodings and multiply them by rating.
	encoded_sequence_movies_with_poistion_and_rating = layers.Multiply()(
	[(encoded_sequence_movies + encodded_positions), sequence_ratings]
	)

	# Construct the transformer inputs.
	for i in range(sequence_length - 1):
	feature = encoded_sequence_movies_with_poistion_and_rating[:, i, ...]
	feature = ops.expand_dims(feature, 1)
	encoded_transformer_features.append(feature)
	encoded_transformer_features.append(encoded_target_movie)
	encoded_transformer_features = layers.concatenate(
	encoded_transformer_features, axis=1
	)
	return encoded_transformer_features, encoded_other_features


	"""
	## Create model inputs
	"""


	def create_model_inputs():
	return {
	"user_id": keras.Input(name="user_id", shape=(1,), dtype="int32"),
	"sequence_movie_ids": keras.Input(
	name="sequence_movie_ids", shape=(sequence_length - 1,), dtype="int32"
	),
	"target_movie_id": keras.Input(
	name="target_movie_id", shape=(1,), dtype="int32"
	),
	"sequence_ratings": keras.Input(
	name="sequence_ratings", shape=(sequence_length - 1,), dtype="float32"
	),
	"sex": keras.Input(name="sex", shape=(1,), dtype="int32"),
	"age_group": keras.Input(name="age_group", shape=(1,), dtype="int32"),
	"occupation": keras.Input(name="occupation", shape=(1,), dtype="int32"),
	}


	"""
	## Create a BST model
	"""

	include_user_id = False
	include_user_features = False
	include_movie_features = False

	hidden_units = [256, 128]
	dropout_rate = 0.1
	num_heads = 3


	def create_model():

	inputs = create_model_inputs()
	transformer_features, other_features = encode_input_features(
	inputs, include_user_id, include_user_features, include_movie_features
	)
	# Create a multi-headed attention layer.
	attention_output = layers.MultiHeadAttention(
	num_heads=num_heads, key_dim=transformer_features.shape[2], dropout=dropout_rate
	)(transformer_features, transformer_features)

	# Transformer block.
	attention_output = layers.Dropout(dropout_rate)(attention_output)
	x1 = layers.Add()([transformer_features, attention_output])
	x1 = layers.LayerNormalization()(x1)
	x2 = layers.LeakyReLU()(x1)
	x2 = layers.Dense(units=x2.shape[-1])(x2)
	x2 = layers.Dropout(dropout_rate)(x2)
	transformer_features = layers.Add()([x1, x2])
	transformer_features = layers.LayerNormalization()(transformer_features)
	features = layers.Flatten()(transformer_features)

	# Included the other_features.
	if other_features is not None:
	features = layers.concatenate(
	[features, layers.Reshape([other_features.shape[-1]])(other_features)]
	)

	# Fully-connected layers.
	for num_units in hidden_units:
	features = layers.Dense(num_units)(features)
	features = layers.BatchNormalization()(features)
	features = layers.LeakyReLU()(features)
	features = layers.Dropout(dropout_rate)(features)
	outputs = layers.Dense(units=1)(features)
	model = keras.Model(inputs=inputs, outputs=outputs)
	return model


	model = create_model()

	"""
	## Run training and evaluation experiment
	"""

	# Compile the model.
	model.compile(
	optimizer=keras.optimizers.Adagrad(learning_rate=0.01),
	loss=keras.losses.MeanSquaredError(),
	metrics=[keras.metrics.MeanAbsoluteError()],
	)

	# Read the training data.

	train_dataset = get_dataset_from_csv("train_data.csv", batch_size=265, shuffle=True)

	# Fit the model with the training data.
	model.fit(train_dataset, epochs=2)

	# Read the test data.
	test_dataset = get_dataset_from_csv("test_data.csv", batch_size=265)

	# Evaluate the model on the test data.
	_, rmse = model.evaluate(test_dataset, verbose=0)
	print(f"Test MAE: {round(rmse, 3)}")

	"""
	You should achieve a Mean Absolute Error (MAE) at or around 0.7 on the test data.
	"""

	"""
	## Conclusion

	The BST model uses the Transformer layer in its architecture to capture the sequential signals underlying
	users’ behavior sequences for recommendation.

	You can try training this model with different configurations, for example, by increasing
	the input sequence length and training the model for a larger number of epochs. In addition,
	you can try including other features like movie release year and customer
	zipcode, and including cross features like sex X genre.
	"""