Spaces:
Running
Running
| """ | |
| Title: Using pre-trained word embeddings | |
| Author: [fchollet](https://twitter.com/fchollet) | |
| Date created: 2020/05/05 | |
| Last modified: 2020/05/05 | |
| Description: Text classification on the Newsgroup20 dataset using pre-trained GloVe word embeddings. | |
| Accelerator: GPU | |
| """ | |
| """ | |
| ## Setup | |
| """ | |
| import os | |
| # Only the TensorFlow backend supports string inputs. | |
| os.environ["KERAS_BACKEND"] = "tensorflow" | |
| import pathlib | |
| import numpy as np | |
| import tensorflow.data as tf_data | |
| import keras | |
| from keras import layers | |
| """ | |
| ## Introduction | |
| In this example, we show how to train a text classification model that uses pre-trained | |
| word embeddings. | |
| We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages | |
| belonging to 20 different topic categories. | |
| For the pre-trained word embeddings, we'll use | |
| [GloVe embeddings](http://nlp.stanford.edu/projects/glove/). | |
| """ | |
| """ | |
| ## Download the Newsgroup20 data | |
| """ | |
| data_path = keras.utils.get_file( | |
| "news20.tar.gz", | |
| "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz", | |
| untar=True, | |
| ) | |
| """ | |
| ## Let's take a look at the data | |
| """ | |
| data_dir = pathlib.Path(data_path).parent / "20_newsgroup" | |
| dirnames = os.listdir(data_dir) | |
| print("Number of directories:", len(dirnames)) | |
| print("Directory names:", dirnames) | |
| fnames = os.listdir(data_dir / "comp.graphics") | |
| print("Number of files in comp.graphics:", len(fnames)) | |
| print("Some example filenames:", fnames[:5]) | |
| """ | |
| Here's a example of what one file contains: | |
| """ | |
| print(open(data_dir / "comp.graphics" / "38987").read()) | |
| """ | |
| As you can see, there are header lines that are leaking the file's category, either | |
| explicitly (the first line is literally the category name), or implicitly, e.g. via the | |
| `Organization` filed. Let's get rid of the headers: | |
| """ | |
| samples = [] | |
| labels = [] | |
| class_names = [] | |
| class_index = 0 | |
| for dirname in sorted(os.listdir(data_dir)): | |
| class_names.append(dirname) | |
| dirpath = data_dir / dirname | |
| fnames = os.listdir(dirpath) | |
| print("Processing %s, %d files found" % (dirname, len(fnames))) | |
| for fname in fnames: | |
| fpath = dirpath / fname | |
| f = open(fpath, encoding="latin-1") | |
| content = f.read() | |
| lines = content.split("\n") | |
| lines = lines[10:] | |
| content = "\n".join(lines) | |
| samples.append(content) | |
| labels.append(class_index) | |
| class_index += 1 | |
| print("Classes:", class_names) | |
| print("Number of samples:", len(samples)) | |
| """ | |
| There's actually one category that doesn't have the expected number of files, but the | |
| difference is small enough that the problem remains a balanced classification problem. | |
| """ | |
| """ | |
| ## Shuffle and split the data into training & validation sets | |
| """ | |
| # Shuffle the data | |
| seed = 1337 | |
| rng = np.random.RandomState(seed) | |
| rng.shuffle(samples) | |
| rng = np.random.RandomState(seed) | |
| rng.shuffle(labels) | |
| # Extract a training & validation split | |
| validation_split = 0.2 | |
| num_validation_samples = int(validation_split * len(samples)) | |
| train_samples = samples[:-num_validation_samples] | |
| val_samples = samples[-num_validation_samples:] | |
| train_labels = labels[:-num_validation_samples] | |
| val_labels = labels[-num_validation_samples:] | |
| """ | |
| ## Create a vocabulary index | |
| Let's use the `TextVectorization` to index the vocabulary found in the dataset. | |
| Later, we'll use the same layer instance to vectorize the samples. | |
| Our layer will only consider the top 20,000 words, and will truncate or pad sequences to | |
| be actually 200 tokens long. | |
| """ | |
| vectorizer = layers.TextVectorization(max_tokens=20000, output_sequence_length=200) | |
| text_ds = tf_data.Dataset.from_tensor_slices(train_samples).batch(128) | |
| vectorizer.adapt(text_ds) | |
| """ | |
| You can retrieve the computed vocabulary used via `vectorizer.get_vocabulary()`. Let's | |
| print the top 5 words: | |
| """ | |
| vectorizer.get_vocabulary()[:5] | |
| """ | |
| Let's vectorize a test sentence: | |
| """ | |
| output = vectorizer([["the cat sat on the mat"]]) | |
| output.numpy()[0, :6] | |
| """ | |
| As you can see, "the" gets represented as "2". Why not 0, given that "the" was the first | |
| word in the vocabulary? That's because index 0 is reserved for padding and index 1 is | |
| reserved for "out of vocabulary" tokens. | |
| Here's a dict mapping words to their indices: | |
| """ | |
| voc = vectorizer.get_vocabulary() | |
| word_index = dict(zip(voc, range(len(voc)))) | |
| """ | |
| As you can see, we obtain the same encoding as above for our test sentence: | |
| """ | |
| test = ["the", "cat", "sat", "on", "the", "mat"] | |
| [word_index[w] for w in test] | |
| """ | |
| ## Load pre-trained word embeddings | |
| """ | |
| """ | |
| Let's download pre-trained GloVe embeddings (a 822M zip file). | |
| You'll need to run the following commands: | |
| """ | |
| """shell | |
| wget https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip | |
| unzip -q glove.6B.zip | |
| """ | |
| """ | |
| The archive contains text-encoded vectors of various sizes: 50-dimensional, | |
| 100-dimensional, 200-dimensional, 300-dimensional. We'll use the 100D ones. | |
| Let's make a dict mapping words (strings) to their NumPy vector representation: | |
| """ | |
| path_to_glove_file = "glove.6B.100d.txt" | |
| embeddings_index = {} | |
| with open(path_to_glove_file) as f: | |
| for line in f: | |
| word, coefs = line.split(maxsplit=1) | |
| coefs = np.fromstring(coefs, "f", sep=" ") | |
| embeddings_index[word] = coefs | |
| print("Found %s word vectors." % len(embeddings_index)) | |
| """ | |
| Now, let's prepare a corresponding embedding matrix that we can use in a Keras | |
| `Embedding` layer. It's a simple NumPy matrix where entry at index `i` is the pre-trained | |
| vector for the word of index `i` in our `vectorizer`'s vocabulary. | |
| """ | |
| num_tokens = len(voc) + 2 | |
| embedding_dim = 100 | |
| hits = 0 | |
| misses = 0 | |
| # Prepare embedding matrix | |
| embedding_matrix = np.zeros((num_tokens, embedding_dim)) | |
| for word, i in word_index.items(): | |
| embedding_vector = embeddings_index.get(word) | |
| if embedding_vector is not None: | |
| # Words not found in embedding index will be all-zeros. | |
| # This includes the representation for "padding" and "OOV" | |
| embedding_matrix[i] = embedding_vector | |
| hits += 1 | |
| else: | |
| misses += 1 | |
| print("Converted %d words (%d misses)" % (hits, misses)) | |
| """ | |
| Next, we load the pre-trained word embeddings matrix into an `Embedding` layer. | |
| Note that we set `trainable=False` so as to keep the embeddings fixed (we don't want to | |
| update them during training). | |
| """ | |
| from keras.layers import Embedding | |
| embedding_layer = Embedding( | |
| num_tokens, | |
| embedding_dim, | |
| trainable=False, | |
| ) | |
| embedding_layer.build((1,)) | |
| embedding_layer.set_weights([embedding_matrix]) | |
| """ | |
| ## Build the model | |
| A simple 1D convnet with global max pooling and a classifier at the end. | |
| """ | |
| int_sequences_input = keras.Input(shape=(None,), dtype="int32") | |
| embedded_sequences = embedding_layer(int_sequences_input) | |
| x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences) | |
| x = layers.MaxPooling1D(5)(x) | |
| x = layers.Conv1D(128, 5, activation="relu")(x) | |
| x = layers.MaxPooling1D(5)(x) | |
| x = layers.Conv1D(128, 5, activation="relu")(x) | |
| x = layers.GlobalMaxPooling1D()(x) | |
| x = layers.Dense(128, activation="relu")(x) | |
| x = layers.Dropout(0.5)(x) | |
| preds = layers.Dense(len(class_names), activation="softmax")(x) | |
| model = keras.Model(int_sequences_input, preds) | |
| model.summary() | |
| """ | |
| ## Train the model | |
| First, convert our list-of-strings data to NumPy arrays of integer indices. The arrays | |
| are right-padded. | |
| """ | |
| x_train = vectorizer(np.array([[s] for s in train_samples])).numpy() | |
| x_val = vectorizer(np.array([[s] for s in val_samples])).numpy() | |
| y_train = np.array(train_labels) | |
| y_val = np.array(val_labels) | |
| """ | |
| We use categorical crossentropy as our loss since we're doing softmax classification. | |
| Moreover, we use `sparse_categorical_crossentropy` since our labels are integers. | |
| """ | |
| model.compile( | |
| loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"] | |
| ) | |
| model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val)) | |
| """ | |
| ## Export an end-to-end model | |
| Now, we may want to export a `Model` object that takes as input a string of arbitrary | |
| length, rather than a sequence of indices. It would make the model much more portable, | |
| since you wouldn't have to worry about the input preprocessing pipeline. | |
| Our `vectorizer` is actually a Keras layer, so it's simple: | |
| """ | |
| string_input = keras.Input(shape=(1,), dtype="string") | |
| x = vectorizer(string_input) | |
| preds = model(x) | |
| end_to_end_model = keras.Model(string_input, preds) | |
| probabilities = end_to_end_model( | |
| keras.ops.convert_to_tensor( | |
| [["this message is about computer graphics and 3D modeling"]] | |
| ) | |
| ) | |
| print(class_names[np.argmax(probabilities[0])]) | |