Spaces:

MCP-1st-Birthday
/

ML-Starter

Running

App Files Files Community

ML-Starter / knowledge_base /nlp /sentence_embeddings_with_sbert.py

emreatilgan

feat: Initialize mcp_server with embedding and loader modules

9ce984a 18 days ago

raw

history blame contribute delete

16.3 kB

	"""
	Title: Sentence embeddings using Siamese RoBERTa-networks
	Author: [Mohammed Abu El-Nasr](https://github.com/abuelnasr0)
	Date created: 2023/07/14
	Last modified: 2023/07/14
	Description: Fine-tune a RoBERTa model to generate sentence embeddings using KerasHub.
	Accelerator: GPU
	"""

	"""
	## Introduction

	BERT and RoBERTa can be used for semantic textual similarity tasks, where two sentences
	are passed to the model and the network predicts whether they are similar or not. But
	what if we have a large collection of sentences and want to find the most similar pairs
	in that collection? That will take n*(n-1)/2 inference computations, where n is the
	number of sentences in the collection. For example, if n = 10000, the required time will
	be 65 hours on a V100 GPU.

	A common method to overcome the time overhead issue is to pass one sentence to the model,
	then average the output of the model, or take the first token (the [CLS] token) and use
	them as a [sentence embedding](https://en.wikipedia.org/wiki/Sentence_embedding), then
	use a vector similarity measure like cosine similarity or Manhatten / Euclidean distance
	to find close sentences (semantically similar sentences). That will reduce the time to
	find the most similar pairs in a collection of 10,000 sentences from 65 hours to 5
	seconds!

	If we use RoBERTa directly, that will yield rather bad sentence embeddings. But if we
	fine-tune RoBERTa using a Siamese network, that will generate semantically meaningful
	sentence embeddings. This will enable RoBERTa to be used for new tasks. These tasks
	include:

	- Large-scale semantic similarity comparison.
	- Clustering.
	- Information retrieval via semantic search.

	In this example, we will show how to fine-tune a RoBERTa model using a Siamese network
	such that it will be able to produce semantically meaningful sentence embeddings and use
	them in a semantic search and clustering example.
	This method of fine-tuning was introduced in
	[Sentence-BERT](https://arxiv.org/abs/1908.10084)
	"""

	"""
	## Setup

	Let's install and import the libraries we need. We'll be using the KerasHub library in
	this example.

	We will also enable [mixed precision](https://www.tensorflow.org/guide/mixed_precision)
	training. This will help us reduce the training time.
	"""

	"""shell
	pip install -q --upgrade keras-hub
	pip install -q --upgrade keras # Upgrade to Keras 3.
	"""

	import os

	os.environ["KERAS_BACKEND"] = "tensorflow"

	import keras
	import keras_hub
	import tensorflow as tf
	import tensorflow_datasets as tfds
	import sklearn.cluster as cluster

	keras.mixed_precision.set_global_policy("mixed_float16")

	"""
	## Fine-tune the model using siamese networks

	[Siamese network](https://en.wikipedia.org/wiki/Siamese_neural_network) is a neural
	network architecture that contains two or more subnetworks. The subnetworks share the
	same weights. It is used to generate feature vectors for each input and then compare them
	for similarity.

	For our example, the subnetwork will be a RoBERTa model that has a pooling layer on top
	of it to produce the embeddings of the input sentences. These embeddings will then be
	compared to each other to learn to produce semantically meaningful embeddings.

	The pooling strategies used are mean, max, and CLS pooling. Mean pooling produces the
	best results. We will use it in our examples.
	"""

	"""
	### Fine-tune using the regression objective function

	For building the siamese network with the regression objective function, the siamese
	network is asked to predict the cosine similarity between the embeddings of the two input
	sentences.

	Cosine similarity indicates the angle between the sentence embeddings. If the cosine
	similarity is high, that means there is a small angle between the embeddings; hence, they
	are semantically similar.
	"""

	"""
	#### Load the dataset

	We will use the STSB dataset to fine-tune the model for the regression objective. STSB
	consists of a collection of sentence pairs that are labelled in the range [0, 5]. 0
	indicates the least semantic similarity between the two sentences, and 5 indicates the
	most semantic similarity between the two sentences.

	The range of the cosine similarity is [-1, 1] and it's the output of the siamese network,
	but the range of the labels in the dataset is [0, 5]. We need to unify the range between
	the cosine similarity and the dataset labels, so while preparing the dataset, we will
	divide the labels by 2.5 and subtract 1.
	"""

	TRAIN_BATCH_SIZE = 6
	VALIDATION_BATCH_SIZE = 8

	TRAIN_NUM_BATCHES = 300
	VALIDATION_NUM_BATCHES = 40

	AUTOTUNE = tf.data.experimental.AUTOTUNE


	def change_range(x):
	return (x / 2.5) - 1


	def prepare_dataset(dataset, num_batches, batch_size):
	dataset = dataset.map(
	lambda z: (
	[z["sentence1"], z["sentence2"]],
	[tf.cast(change_range(z["label"]), tf.float32)],
	),
	num_parallel_calls=AUTOTUNE,
	)
	dataset = dataset.batch(batch_size)
	dataset = dataset.take(num_batches)
	dataset = dataset.prefetch(AUTOTUNE)
	return dataset


	stsb_ds = tfds.load(
	"glue/stsb",
	)
	stsb_train, stsb_valid = stsb_ds["train"], stsb_ds["validation"]

	stsb_train = prepare_dataset(stsb_train, TRAIN_NUM_BATCHES, TRAIN_BATCH_SIZE)
	stsb_valid = prepare_dataset(stsb_valid, VALIDATION_NUM_BATCHES, VALIDATION_BATCH_SIZE)

	"""
	Let's see examples from the dataset of two sentenses and their similarity.
	"""

	for x, y in stsb_train:
	for i, example in enumerate(x):
	print(f"sentence 1 : {example[0]} ")
	print(f"sentence 2 : {example[1]} ")
	print(f"similarity : {y[i]} \n")
	break

	"""
	#### Build the encoder model.

	Now, we'll build the encoder model that will produce the sentence embeddings. It consists
	of:

	- A preprocessor layer to tokenize and generate padding masks for the sentences.
	- A backbone model that will generate the contextual representation of each token in the
	sentence.
	- A mean pooling layer to produce the embeddings. We will use `keras.layers.GlobalAveragePooling1D`
	to apply the mean pooling to the backbone outputs. We will pass the padding mask to the
	layer to exclude padded tokens from being averaged.
	- A normalization layer to normalize the embeddings as we are using the cosine similarity.
	"""

	preprocessor = keras_hub.models.RobertaPreprocessor.from_preset("roberta_base_en")
	backbone = keras_hub.models.RobertaBackbone.from_preset("roberta_base_en")
	inputs = keras.Input(shape=(1,), dtype="string", name="sentence")
	x = preprocessor(inputs)
	h = backbone(x)
	embedding = keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
	h, x["padding_mask"]
	)
	n_embedding = keras.layers.UnitNormalization(axis=1)(embedding)
	roberta_normal_encoder = keras.Model(inputs=inputs, outputs=n_embedding)

	roberta_normal_encoder.summary()

	"""
	#### Build the Siamese network with the regression objective function.

	It's described above that the Siamese network has two or more subnetworks, and for this
	Siamese model, we need two encoders. But we don't have two encoders; we have only one
	encoder, but we will pass the two sentences through it. That way, we can have two paths
	to get the embeddings and also shared weights between the two paths.

	After passing the two sentences to the model and getting the normalized embeddings, we
	will multiply the two normalized embeddings to get the cosine similarity between the two
	sentences.
	"""


	class RegressionSiamese(keras.Model):
	def __init__(self, encoder, **kwargs):
	inputs = keras.Input(shape=(2,), dtype="string", name="sentences")
	sen1, sen2 = keras.ops.split(inputs, 2, axis=1)
	u = encoder(sen1)
	v = encoder(sen2)
	cosine_similarity_scores = keras.ops.matmul(u, keras.ops.transpose(v))

	super().__init__(
	inputs=inputs,
	outputs=cosine_similarity_scores,
	**kwargs,
	)

	self.encoder = encoder

	def get_encoder(self):
	return self.encoder


	"""
	#### Fit the model

	Let's try this example before training and compare it to the output after training.
	"""

	sentences = [
	"Today is a very sunny day.",
	"I am hungry, I will get my meal.",
	"The dog is eating his food.",
	]
	query = ["The dog is enjoying his meal."]

	encoder = roberta_normal_encoder

	sentence_embeddings = encoder(tf.constant(sentences))
	query_embedding = encoder(tf.constant(query))

	cosine_similarity_scores = tf.matmul(query_embedding, tf.transpose(sentence_embeddings))
	for i, sim in enumerate(cosine_similarity_scores[0]):
	print(f"cosine similarity score between sentence {i+1} and the query = {sim} ")

	"""
	For the training we will use `MeanSquaredError()` as loss function, and `Adam()`
	optimizer with learning rate = 2e-5.
	"""

	roberta_regression_siamese = RegressionSiamese(roberta_normal_encoder)

	roberta_regression_siamese.compile(
	loss=keras.losses.MeanSquaredError(),
	optimizer=keras.optimizers.Adam(2e-5),
	jit_compile=False,
	)

	roberta_regression_siamese.fit(stsb_train, validation_data=stsb_valid, epochs=1)

	"""
	Let's try the model after training, we will notice a huge difference in the output. That
	means that the model after fine-tuning is capable of producing semantically meaningful
	embeddings. where the semantically similar sentences have a small angle between them. and
	semantically dissimilar sentences have a large angle between them.
	"""

	sentences = [
	"Today is a very sunny day.",
	"I am hungry, I will get my meal.",
	"The dog is eating his food.",
	]
	query = ["The dog is enjoying his food."]

	encoder = roberta_regression_siamese.get_encoder()

	sentence_embeddings = encoder(tf.constant(sentences))
	query_embedding = encoder(tf.constant(query))

	cosine_simalarities = tf.matmul(query_embedding, tf.transpose(sentence_embeddings))
	for i, sim in enumerate(cosine_simalarities[0]):
	print(f"cosine similarity between sentence {i+1} and the query = {sim} ")

	"""
	### Fine-tune Using the triplet Objective Function

	For the Siamese network with the triplet objective function, three sentences are passed
	to the Siamese network anchor, positive, and negative sentences. anchor and
	positive sentences are semantically similar, and anchor and negative sentences are
	semantically dissimilar. The objective is to minimize the distance between the anchor
	sentence and the positive sentence, and to maximize the distance between the anchor
	sentence and the negative sentence.
	"""

	"""
	#### Load the dataset

	We will use the Wikipedia-sections-triplets dataset for fine-tuning. This data set
	consists of sentences derived from the Wikipedia website. It has a collection of 3
	sentences anchor, positive, negative. anchor and positive are derived from the
	same section. anchor and negative are derived from different sections.

	This dataset has 1.8 million training triplets and 220,000 test triplets. In this
	example, we will only use 1200 triplets for training and 300 for testing.
	"""

	"""shell
	wget https://sbert.net/datasets/wikipedia-sections-triplets.zip -q
	unzip wikipedia-sections-triplets.zip -d wikipedia-sections-triplets
	"""

	NUM_TRAIN_BATCHES = 200
	NUM_TEST_BATCHES = 75
	AUTOTUNE = tf.data.experimental.AUTOTUNE


	def prepare_wiki_data(dataset, num_batches):
	dataset = dataset.map(
	lambda z: ((z["Sentence1"], z["Sentence2"], z["Sentence3"]), 0)
	)
	dataset = dataset.batch(6)
	dataset = dataset.take(num_batches)
	dataset = dataset.prefetch(AUTOTUNE)
	return dataset


	wiki_train = tf.data.experimental.make_csv_dataset(
	"wikipedia-sections-triplets/train.csv",
	batch_size=1,
	num_epochs=1,
	)
	wiki_test = tf.data.experimental.make_csv_dataset(
	"wikipedia-sections-triplets/test.csv",
	batch_size=1,
	num_epochs=1,
	)

	wiki_train = prepare_wiki_data(wiki_train, NUM_TRAIN_BATCHES)
	wiki_test = prepare_wiki_data(wiki_test, NUM_TEST_BATCHES)

	"""
	#### Build the encoder model

	For this encoder model, we will use RoBERTa with mean pooling and we will not normalize
	the output embeddings. The encoder model consists of:

	- A preprocessor layer to tokenize and generate padding masks for the sentences.
	- A backbone model that will generate the contextual representation of each token in the
	sentence.
	- A mean pooling layer to produce the embeddings.
	"""

	preprocessor = keras_hub.models.RobertaPreprocessor.from_preset("roberta_base_en")
	backbone = keras_hub.models.RobertaBackbone.from_preset("roberta_base_en")
	input = keras.Input(shape=(1,), dtype="string", name="sentence")

	x = preprocessor(input)
	h = backbone(x)
	embedding = keras.layers.GlobalAveragePooling1D(name="pooling_layer")(
	h, x["padding_mask"]
	)

	roberta_encoder = keras.Model(inputs=input, outputs=embedding)


	roberta_encoder.summary()

	"""
	#### Build the Siamese network with the triplet objective function

	For the Siamese network with the triplet objective function, we will build the model with
	an encoder, and we will pass the three sentences through that encoder. We will get an
	embedding for each sentence, and we will calculate the `positive_dist` and
	`negative_dist` that will be passed to the loss function described below.
	"""


	class TripletSiamese(keras.Model):
	def __init__(self, encoder, **kwargs):
	anchor = keras.Input(shape=(1,), dtype="string")
	positive = keras.Input(shape=(1,), dtype="string")
	negative = keras.Input(shape=(1,), dtype="string")

	ea = encoder(anchor)
	ep = encoder(positive)
	en = encoder(negative)

	positive_dist = keras.ops.sum(keras.ops.square(ea - ep), axis=1)
	negative_dist = keras.ops.sum(keras.ops.square(ea - en), axis=1)

	positive_dist = keras.ops.sqrt(positive_dist)
	negative_dist = keras.ops.sqrt(negative_dist)

	output = keras.ops.stack([positive_dist, negative_dist], axis=0)

	super().__init__(inputs=[anchor, positive, negative], outputs=output, **kwargs)

	self.encoder = encoder

	def get_encoder(self):
	return self.encoder


	"""
	We will use a custom loss function for the triplet objective. The loss function will
	receive the distance between the anchor and the positive embeddings `positive_dist`,
	and the distance between the anchor and the negative embeddings `negative_dist`,
	where they are stacked together in `y_pred`.

	We will use `positive_dist` and `negative_dist` to compute the loss such that
	`negative_dist` is larger than `positive_dist` at least by a specific margin.
	Mathematically, we will minimize this loss function: `max( positive_dist - negative_dist
	+ margin, 0)`.

	There is no `y_true` used in this loss function. Note that we set the labels in the
	dataset to zero, but they will not be used.
	"""


	class TripletLoss(keras.losses.Loss):
	def __init__(self, margin=1, **kwargs):
	super().__init__(**kwargs)
	self.margin = margin

	def call(self, y_true, y_pred):
	positive_dist, negative_dist = tf.unstack(y_pred, axis=0)

	losses = keras.ops.relu(positive_dist - negative_dist + self.margin)
	return keras.ops.mean(losses, axis=0)


	"""
	#### Fit the model

	For the training, we will use the custom `TripletLoss()` loss function, and `Adam()`
	optimizer with a learning rate = 2e-5.
	"""

	roberta_triplet_siamese = TripletSiamese(roberta_encoder)

	roberta_triplet_siamese.compile(
	loss=TripletLoss(),
	optimizer=keras.optimizers.Adam(2e-5),
	jit_compile=False,
	)

	roberta_triplet_siamese.fit(wiki_train, validation_data=wiki_test, epochs=1)

	"""
	Let's try this model in a clustering example. Here are 6 questions. first 3 questions
	about learning English, and the last 3 questions about working online. Let's see if the
	embeddings produced by our encoder will cluster them correctly.
	"""

	questions = [
	"What should I do to improve my English writting?",
	"How to be good at speaking English?",
	"How can I improve my English?",
	"How to earn money online?",
	"How do I earn money online?",
	"How to work and earn money through internet?",
	]

	encoder = roberta_triplet_siamese.get_encoder()
	embeddings = encoder(tf.constant(questions))
	kmeans = cluster.KMeans(n_clusters=2, random_state=0, n_init="auto").fit(embeddings)

	for i, label in enumerate(kmeans.labels_):
	print(f"sentence ({questions[i]}) belongs to cluster {label}")