Spaces:
Running
Running
| """ | |
| Title: Text Classification using FNet | |
| Author: [Abheesht Sharma](https://github.com/abheesht17/) | |
| Date created: 2022/06/01 | |
| Last modified: 2022/12/21 | |
| Description: Text Classification on the IMDb Dataset using `keras_hub.layers.FNetEncoder` layer. | |
| Accelerator: GPU | |
| """ | |
| """ | |
| ## Introduction | |
| In this example, we will demonstrate the ability of FNet to achieve comparable | |
| results with a vanilla Transformer model on the text classification task. | |
| We will be using the IMDb dataset, which is a | |
| collection of movie reviews labelled either positive or negative (sentiment | |
| analysis). | |
| To build the tokenizer, model, etc., we will use components from | |
| [KerasHub](https://github.com/keras-team/keras-hub). KerasHub makes life easier | |
| for people who want to build NLP pipelines! :) | |
| ### Model | |
| Transformer-based language models (LMs) such as BERT, RoBERTa, XLNet, etc. have | |
| demonstrated the effectiveness of the self-attention mechanism for computing | |
| rich embeddings for input text. However, the self-attention mechanism is an | |
| expensive operation, with a time complexity of `O(n^2)`, where `n` is the number | |
| of tokens in the input. Hence, there has been an effort to reduce the time | |
| complexity of the self-attention mechanism and improve performance without | |
| sacrificing the quality of results. | |
| In 2020, a paper titled | |
| [FNet: Mixing Tokens with Fourier Transforms](https://arxiv.org/abs/2105.03824) | |
| replaced the self-attention layer in BERT with a simple Fourier Transform layer | |
| for "token mixing". This resulted in comparable accuracy and a speed-up during | |
| training. In particular, a couple of points from the paper stand out: | |
| * The authors claim that FNet is 80% faster than BERT on GPUs and 70% faster on | |
| TPUs. The reason for this speed-up is two-fold: a) the Fourier Transform layer | |
| is unparametrized, it does not have any parameters, and b) the authors use Fast | |
| Fourier Transform (FFT); this reduces the time complexity from `O(n^2)` | |
| (in the case of self-attention) to `O(n log n)`. | |
| * FNet manages to achieve 92-97% of the accuracy of BERT on the GLUE benchmark. | |
| """ | |
| """ | |
| ## Setup | |
| Before we start with the implementation, let's import all the necessary packages. | |
| """ | |
| """shell | |
| pip install -q --upgrade keras-hub | |
| pip install -q --upgrade keras # Upgrade to Keras 3. | |
| """ | |
| import keras_hub | |
| import keras | |
| import tensorflow as tf | |
| import os | |
| keras.utils.set_random_seed(42) | |
| """ | |
| Let's also define our hyperparameters. | |
| """ | |
| BATCH_SIZE = 64 | |
| EPOCHS = 3 | |
| MAX_SEQUENCE_LENGTH = 512 | |
| VOCAB_SIZE = 15000 | |
| EMBED_DIM = 128 | |
| INTERMEDIATE_DIM = 512 | |
| """ | |
| ## Loading the dataset | |
| First, let's download the IMDB dataset and extract it. | |
| """ | |
| """shell | |
| wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz | |
| tar -xzf aclImdb_v1.tar.gz | |
| """ | |
| """ | |
| Samples are present in the form of text files. Let's inspect the structure of | |
| the directory. | |
| """ | |
| print(os.listdir("./aclImdb")) | |
| print(os.listdir("./aclImdb/train")) | |
| print(os.listdir("./aclImdb/test")) | |
| """ | |
| The directory contains two sub-directories: `train` and `test`. Each subdirectory | |
| in turn contains two folders: `pos` and `neg` for positive and negative reviews, | |
| respectively. Before we load the dataset, let's delete the `./aclImdb/train/unsup` | |
| folder since it has unlabelled samples. | |
| """ | |
| """shell | |
| rm -rf aclImdb/train/unsup | |
| """ | |
| """ | |
| We'll use the `keras.utils.text_dataset_from_directory` utility to generate | |
| our labelled `tf.data.Dataset` dataset from text files. | |
| """ | |
| train_ds = keras.utils.text_dataset_from_directory( | |
| "aclImdb/train", | |
| batch_size=BATCH_SIZE, | |
| validation_split=0.2, | |
| subset="training", | |
| seed=42, | |
| ) | |
| val_ds = keras.utils.text_dataset_from_directory( | |
| "aclImdb/train", | |
| batch_size=BATCH_SIZE, | |
| validation_split=0.2, | |
| subset="validation", | |
| seed=42, | |
| ) | |
| test_ds = keras.utils.text_dataset_from_directory("aclImdb/test", batch_size=BATCH_SIZE) | |
| """ | |
| We will now convert the text to lowercase. | |
| """ | |
| train_ds = train_ds.map(lambda x, y: (tf.strings.lower(x), y)) | |
| val_ds = val_ds.map(lambda x, y: (tf.strings.lower(x), y)) | |
| test_ds = test_ds.map(lambda x, y: (tf.strings.lower(x), y)) | |
| """ | |
| Let's print a few samples. | |
| """ | |
| for text_batch, label_batch in train_ds.take(1): | |
| for i in range(3): | |
| print(text_batch.numpy()[i]) | |
| print(label_batch.numpy()[i]) | |
| """ | |
| ### Tokenizing the data | |
| We'll be using the `keras_hub.tokenizers.WordPieceTokenizer` layer to tokenize | |
| the text. `keras_hub.tokenizers.WordPieceTokenizer` takes a WordPiece vocabulary | |
| and has functions for tokenizing the text, and detokenizing sequences of tokens. | |
| Before we define the tokenizer, we first need to train it on the dataset | |
| we have. The WordPiece tokenization algorithm is a subword tokenization algorithm; | |
| training it on a corpus gives us a vocabulary of subwords. A subword tokenizer | |
| is a compromise between word tokenizers (word tokenizers need very large | |
| vocabularies for good coverage of input words), and character tokenizers | |
| (characters don't really encode meaning like words do). Luckily, KerasHub | |
| makes it very simple to train WordPiece on a corpus with the | |
| `keras_hub.tokenizers.compute_word_piece_vocabulary` utility. | |
| Note: The official implementation of FNet uses the SentencePiece Tokenizer. | |
| """ | |
| def train_word_piece(ds, vocab_size, reserved_tokens): | |
| word_piece_ds = ds.unbatch().map(lambda x, y: x) | |
| vocab = keras_hub.tokenizers.compute_word_piece_vocabulary( | |
| word_piece_ds.batch(1000).prefetch(2), | |
| vocabulary_size=vocab_size, | |
| reserved_tokens=reserved_tokens, | |
| ) | |
| return vocab | |
| """ | |
| Every vocabulary has a few special, reserved tokens. We have two such tokens: | |
| - `"[PAD]"` - Padding token. Padding tokens are appended to the input sequence length | |
| when the input sequence length is shorter than the maximum sequence length. | |
| - `"[UNK]"` - Unknown token. | |
| """ | |
| reserved_tokens = ["[PAD]", "[UNK]"] | |
| train_sentences = [element[0] for element in train_ds] | |
| vocab = train_word_piece(train_ds, VOCAB_SIZE, reserved_tokens) | |
| """ | |
| Let's see some tokens! | |
| """ | |
| print("Tokens: ", vocab[100:110]) | |
| """ | |
| Now, let's define the tokenizer. We will configure the tokenizer with the | |
| the vocabularies trained above. We will define a maximum sequence length so that | |
| all sequences are padded to the same length, if the length of the sequence is | |
| less than the specified sequence length. Otherwise, the sequence is truncated. | |
| """ | |
| tokenizer = keras_hub.tokenizers.WordPieceTokenizer( | |
| vocabulary=vocab, | |
| lowercase=False, | |
| sequence_length=MAX_SEQUENCE_LENGTH, | |
| ) | |
| """ | |
| Let's try and tokenize a sample from our dataset! To verify whether the text has | |
| been tokenized correctly, we can also detokenize the list of tokens back to the | |
| original text. | |
| """ | |
| input_sentence_ex = train_ds.take(1).get_single_element()[0][0] | |
| input_tokens_ex = tokenizer(input_sentence_ex) | |
| print("Sentence: ", input_sentence_ex) | |
| print("Tokens: ", input_tokens_ex) | |
| print("Recovered text after detokenizing: ", tokenizer.detokenize(input_tokens_ex)) | |
| """ | |
| ## Formatting the dataset | |
| Next, we'll format our datasets in the form that will be fed to the models. We | |
| need to tokenize the text. | |
| """ | |
| def format_dataset(sentence, label): | |
| sentence = tokenizer(sentence) | |
| return ({"input_ids": sentence}, label) | |
| def make_dataset(dataset): | |
| dataset = dataset.map(format_dataset, num_parallel_calls=tf.data.AUTOTUNE) | |
| return dataset.shuffle(512).prefetch(16).cache() | |
| train_ds = make_dataset(train_ds) | |
| val_ds = make_dataset(val_ds) | |
| test_ds = make_dataset(test_ds) | |
| """ | |
| ## Building the model | |
| Now, let's move on to the exciting part - defining our model! | |
| We first need an embedding layer, i.e., a layer that maps every token in the input | |
| sequence to a vector. This embedding layer can be initialised randomly. We also | |
| need a positional embedding layer which encodes the word order in the sequence. | |
| The convention is to add, i.e., sum, these two embeddings. KerasHub has a | |
| `keras_hub.layers.TokenAndPositionEmbedding ` layer which does all of the above | |
| steps for us. | |
| Our FNet classification model consists of three `keras_hub.layers.FNetEncoder` | |
| layers with a `keras.layers.Dense` layer on top. | |
| Note: For FNet, masking the padding tokens has a minimal effect on results. In the | |
| official implementation, the padding tokens are not masked. | |
| """ | |
| input_ids = keras.Input(shape=(None,), dtype="int64", name="input_ids") | |
| x = keras_hub.layers.TokenAndPositionEmbedding( | |
| vocabulary_size=VOCAB_SIZE, | |
| sequence_length=MAX_SEQUENCE_LENGTH, | |
| embedding_dim=EMBED_DIM, | |
| mask_zero=True, | |
| )(input_ids) | |
| x = keras_hub.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x) | |
| x = keras_hub.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x) | |
| x = keras_hub.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x) | |
| x = keras.layers.GlobalAveragePooling1D()(x) | |
| x = keras.layers.Dropout(0.1)(x) | |
| outputs = keras.layers.Dense(1, activation="sigmoid")(x) | |
| fnet_classifier = keras.Model(input_ids, outputs, name="fnet_classifier") | |
| """ | |
| ## Training our model | |
| We'll use accuracy to monitor training progress on the validation data. Let's | |
| train our model for 3 epochs. | |
| """ | |
| fnet_classifier.summary() | |
| fnet_classifier.compile( | |
| optimizer=keras.optimizers.Adam(learning_rate=0.001), | |
| loss="binary_crossentropy", | |
| metrics=["accuracy"], | |
| ) | |
| fnet_classifier.fit(train_ds, epochs=EPOCHS, validation_data=val_ds) | |
| """ | |
| We obtain a train accuracy of around 92% and a validation accuracy of around | |
| 85%. Moreover, for 3 epochs, it takes around 86 seconds to train the model | |
| (on Colab with a 16 GB Tesla T4 GPU). | |
| Let's calculate the test accuracy. | |
| """ | |
| fnet_classifier.evaluate(test_ds, batch_size=BATCH_SIZE) | |
| """ | |
| ## Comparison with Transformer model | |
| Let's compare our FNet Classifier model with a Transformer Classifier model. We | |
| keep all the parameters/hyperparameters the same. For example, we use three | |
| `TransformerEncoder` layers. | |
| We set the number of heads to 2. | |
| """ | |
| NUM_HEADS = 2 | |
| input_ids = keras.Input(shape=(None,), dtype="int64", name="input_ids") | |
| x = keras_hub.layers.TokenAndPositionEmbedding( | |
| vocabulary_size=VOCAB_SIZE, | |
| sequence_length=MAX_SEQUENCE_LENGTH, | |
| embedding_dim=EMBED_DIM, | |
| mask_zero=True, | |
| )(input_ids) | |
| x = keras_hub.layers.TransformerEncoder( | |
| intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS | |
| )(inputs=x) | |
| x = keras_hub.layers.TransformerEncoder( | |
| intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS | |
| )(inputs=x) | |
| x = keras_hub.layers.TransformerEncoder( | |
| intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS | |
| )(inputs=x) | |
| x = keras.layers.GlobalAveragePooling1D()(x) | |
| x = keras.layers.Dropout(0.1)(x) | |
| outputs = keras.layers.Dense(1, activation="sigmoid")(x) | |
| transformer_classifier = keras.Model(input_ids, outputs, name="transformer_classifier") | |
| transformer_classifier.summary() | |
| transformer_classifier.compile( | |
| optimizer=keras.optimizers.Adam(learning_rate=0.001), | |
| loss="binary_crossentropy", | |
| metrics=["accuracy"], | |
| ) | |
| transformer_classifier.fit(train_ds, epochs=EPOCHS, validation_data=val_ds) | |
| """ | |
| We obtain a train accuracy of around 94% and a validation accuracy of around | |
| 86.5%. It takes around 146 seconds to train the model (on Colab with a 16 GB Tesla | |
| T4 GPU). | |
| Let's calculate the test accuracy. | |
| """ | |
| transformer_classifier.evaluate(test_ds, batch_size=BATCH_SIZE) | |
| """ | |
| Let's make a table and compare the two models. We can see that FNet | |
| significantly speeds up our run time (1.7x), with only a small sacrifice in | |
| overall accuracy (drop of 0.75%). | |
| | | **FNet Classifier** | **Transformer Classifier** | | |
| |:-----------------------:|:-------------------:|:--------------------------:| | |
| | **Training Time** | 86 seconds | 146 seconds | | |
| | **Train Accuracy** | 92.34% | 93.85% | | |
| | **Validation Accuracy** | 85.21% | 86.42% | | |
| | **Test Accuracy** | 83.94% | 84.69% | | |
| | **#Params** | 2,321,921 | 2,520,065 | | |
| """ | |