Spaces:

MCP-1st-Birthday
/

ML-Starter

Running

App Files Files Community

ML-Starter / knowledge_base /nlp /pretrained_word_embeddings.py

emreatilgan

feat: Initialize mcp_server with embedding and loader modules

9ce984a 19 days ago

raw

history blame contribute delete

8.51 kB

	"""
	Title: Using pre-trained word embeddings
	Author: [fchollet](https://twitter.com/fchollet)
	Date created: 2020/05/05
	Last modified: 2020/05/05
	Description: Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings.
	Accelerator: GPU
	"""

	"""
	## Setup
	"""

	import os

	# Only the TensorFlow backend supports string inputs.
	os.environ["KERAS_BACKEND"] = "tensorflow"

	import pathlib
	import numpy as np
	import tensorflow.data as tf_data
	import keras
	from keras import layers

	"""
	## Introduction

	In this example, we show how to train a text classification model that uses pre-trained
	word embeddings.

	We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
	belonging to 20 different topic categories.

	For the pre-trained word embeddings, we'll use
	[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).
	"""

	"""
	## Download the Newsgroup20 data
	"""

	data_path = keras.utils.get_file(
	"news20.tar.gz",
	"http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
	untar=True,
	)

	"""
	## Let's take a look at the data
	"""

	data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
	dirnames = os.listdir(data_dir)
	print("Number of directories:", len(dirnames))
	print("Directory names:", dirnames)

	fnames = os.listdir(data_dir / "comp.graphics")
	print("Number of files in comp.graphics:", len(fnames))
	print("Some example filenames:", fnames[:5])

	"""
	Here's a example of what one file contains:
	"""

	print(open(data_dir / "comp.graphics" / "38987").read())

	"""
	As you can see, there are header lines that are leaking the file's category, either
	explicitly (the first line is literally the category name), or implicitly, e.g. via the
	`Organization` filed. Let's get rid of the headers:
	"""

	samples = []
	labels = []
	class_names = []
	class_index = 0
	for dirname in sorted(os.listdir(data_dir)):
	class_names.append(dirname)
	dirpath = data_dir / dirname
	fnames = os.listdir(dirpath)
	print("Processing %s, %d files found" % (dirname, len(fnames)))
	for fname in fnames:
	fpath = dirpath / fname
	f = open(fpath, encoding="latin-1")
	content = f.read()
	lines = content.split("\n")
	lines = lines[10:]
	content = "\n".join(lines)
	samples.append(content)
	labels.append(class_index)
	class_index += 1

	print("Classes:", class_names)
	print("Number of samples:", len(samples))

	"""
	There's actually one category that doesn't have the expected number of files, but the
	difference is small enough that the problem remains a balanced classification problem.
	"""

	"""
	## Shuffle and split the data into training & validation sets
	"""

	# Shuffle the data
	seed = 1337
	rng = np.random.RandomState(seed)
	rng.shuffle(samples)
	rng = np.random.RandomState(seed)
	rng.shuffle(labels)

	# Extract a training & validation split
	validation_split = 0.2
	num_validation_samples = int(validation_split * len(samples))
	train_samples = samples[:-num_validation_samples]
	val_samples = samples[-num_validation_samples:]
	train_labels = labels[:-num_validation_samples]
	val_labels = labels[-num_validation_samples:]

	"""
	## Create a vocabulary index

	Let's use the `TextVectorization` to index the vocabulary found in the dataset.
	Later, we'll use the same layer instance to vectorize the samples.

	Our layer will only consider the top 20,000 words, and will truncate or pad sequences to
	be actually 200 tokens long.
	"""

	vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=200)
	text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128)
	vectorizer.adapt(text_ds)

	"""
	You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's
	print the top 5 words:
	"""

	vectorizer.get_vocabulary()[:5]

	"""
	Let's vectorize a test sentence:
	"""

	output = vectorizer([["the cat sat on the mat"]])
	output.numpy()[0, :6]

	"""
	As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first
	word in the vocabulary? That's because index 0 is reserved for padding and index 1 is
	reserved for "out of vocabulary" tokens.

	Here's a dict mapping words to their indices:
	"""

	voc = vectorizer.get_vocabulary()
	word_index = dict(zip(voc, range(len(voc))))

	"""
	As you can see, we obtain the same encoding as above for our test sentence:
	"""

	test = ["the", "cat", "sat", "on", "the", "mat"]
	[word_index[w] for w in test]

	"""
	## Load pre-trained word embeddings
	"""

	"""
	Let's download pre-trained GloVe embeddings (a 822M zip file).

	You'll need to run the following commands:
	"""

	"""shell
	wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
	unzip -q glove.6B.zip
	"""

	"""
	The archive contains text-encoded vectors of various sizes: 50-dimensional,
	100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones.

	Let's make a dict mapping words (strings) to their NumPy vector representation:
	"""

	path_to_glove_file = "glove.6B.100d.txt"

	embeddings_index = {}
	with open(path_to_glove_file) as f:
	for line in f:
	word, coefs = line.split(maxsplit=1)
	coefs = np.fromstring(coefs, "f", sep=" ")
	embeddings_index[word] = coefs

	print("Found %s word vectors." % len(embeddings_index))

	"""
	Now, let's prepare a corresponding embedding matrix that we can use in a Keras
	`Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained
	vector for the word of index `i` in our `vectorizer`'s vocabulary.
	"""

	num_tokens = len(voc) + 2
	embedding_dim = 100
	hits = 0
	misses = 0

	# Prepare embedding matrix
	embedding_matrix = np.zeros((num_tokens, embedding_dim))
	for word, i in word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
	# Words not found in embedding index will be all-zeros.
	# This includes the representation for "padding" and "OOV"
	embedding_matrix[i] = embedding_vector
	hits += 1
	else:
	misses += 1
	print("Converted %d words (%d misses)" % (hits, misses))


	"""
	Next, we load the pre-trained word embeddings matrix into an `Embedding` layer.

	Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to
	update them during training).
	"""

	from keras.layers import Embedding

	embedding_layer = Embedding(
	num_tokens,
	embedding_dim,
	trainable=False,
	)
	embedding_layer.build((1,))
	embedding_layer.set_weights([embedding_matrix])

	"""
	## Build the model

	A simple 1D convnet with global max pooling and a classifier at the end.
	"""

	int_sequences_input = keras.Input(shape=(None,), dtype="int32")
	embedded_sequences = embedding_layer(int_sequences_input)
	x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
	x = layers.MaxPooling1D(5)(x)
	x = layers.Conv1D(128, 5, activation="relu")(x)
	x = layers.MaxPooling1D(5)(x)
	x = layers.Conv1D(128, 5, activation="relu")(x)
	x = layers.GlobalMaxPooling1D()(x)
	x = layers.Dense(128, activation="relu")(x)
	x = layers.Dropout(0.5)(x)
	preds = layers.Dense(len(class_names), activation="softmax")(x)
	model = keras.Model(int_sequences_input, preds)
	model.summary()

	"""
	## Train the model

	First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays
	are right-padded.
	"""

	x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
	x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

	y_train = np.array(train_labels)
	y_val = np.array(val_labels)

	"""
	We use categorical crossentropy as our loss since we're doing softmax classification.
	Moreover, we use `sparse_categorical_crossentropy` since our labels are integers.
	"""

	model.compile(
	loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
	)
	model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

	"""
	## Export an end-to-end model

	Now, we may want to export a `Model` object that takes as input a string of arbitrary
	length, rather than a sequence of indices. It would make the model much more portable,
	since you wouldn't have to worry about the input preprocessing pipeline.

	Our `vectorizer` is actually a Keras layer, so it's simple:
	"""

	string_input = keras.Input(shape=(1,), dtype="string")
	x = vectorizer(string_input)
	preds = model(x)
	end_to_end_model = keras.Model(string_input, preds)

	probabilities = end_to_end_model(
	keras.ops.convert_to_tensor(
	[["this message is about computer graphics and 3D modeling"]]
	)
	)

	print(class_names[np.argmax(probabilities[0])])