File size: 20,708 Bytes

c7a6fe6

## Homework 1 (Part I) – Transformer-Based Sentiment Analysis on IMDB

### 1. Introduction

This project implements a Transformer-based neural network for binary sentiment analysis on movie reviews. Given a raw text review from the IMDB dataset, the model predicts whether the review expresses a positive or negative sentiment. Sentiment analysis is a fundamental task in natural language processing (NLP) with applications in opinion mining, recommendation systems, and social media monitoring, where understanding users’ attitudes at scale is essential.

The Transformer architecture is well suited to this task because it relies on self-attention rather than recurrent connections. Self-attention allows the model to capture long-range dependencies between words regardless of their distance in the sequence and to focus on sentiment-bearing tokens (for example, “not”, “excellent”, “boring”) even when they appear far apart. This makes Transformers more effective and easier to train in parallel than traditional RNN or LSTM models for document-level classification.

### 2. Dataset Description

The model is trained and validated on the IMDB movie reviews dataset, loaded via the HuggingFace `datasets` library using the `"imdb"` configuration. This dataset corresponds to the Large Movie Review Dataset introduced by Maas et al. (2011), which contains 50,000 labeled movie reviews (25,000 train and 25,000 test) plus additional unlabeled reviews. The original dataset is available from the Stanford AI Lab at `http://ai.stanford.edu/~amaas/data/sentiment/`. Each labeled review is annotated with a binary label: 0 for negative and 1 for positive sentiment.

In this implementation, I use the labeled training split exposed by the HuggingFace wrapper and perform an additional 80/20 train/validation split with stratification over the labels. This results in approximately 20,000 reviews for training and 5,000 reviews for validation, preserving the original class balance. The official test split and the unlabeled reviews from the Large Movie Review Dataset are not used in this assignment. The labels are integers taking values in {0, 1}, corresponding to negative and positive sentiment respectively.

The raw text consists of HTML-formatted movie reviews. As a minimal preprocessing step, HTML line breaks (`<br />`) are removed and all non-alphanumeric characters (except whitespace) are filtered out. No additional filtering or subsetting (such as length-based pruning) is performed; instead, sequence length is controlled later during token-to-index mapping and padding.

When using this dataset, the canonical citation requested by the authors is:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150.

### 3. Data Preprocessing

Preprocessing is implemented in three main stages: tokenization, vocabulary construction, and sequence padding.

Tokenization uses a simple regular expression–based tokenizer. The text is converted to lowercase, HTML line breaks are replaced by spaces, and all characters that are not letters, digits, or whitespace are removed. The cleaned string is then split on whitespace to obtain a list of tokens. This approach yields a plain word-level tokenization that is easy to interpret and sufficient for this assignment.

Vocabulary construction builds a word-to-index mapping from the training texts only, which avoids information leakage from the validation set. A `Counter` is used to accumulate token frequencies over all training reviews. Two special tokens are reserved: `<PAD>` with index 0 and `<UNK>` with index 1. The remaining slots (up to `MAX_VOCAB = 5000`) are filled with the most frequent words in the training corpus. Any word outside this vocabulary is mapped to the `<UNK>` index at inference time.

For sequence padding and truncation, each review is converted to a fixed-length sequence of token IDs with `MAX_LEN = 64`. Tokens are mapped to their integer indices via the vocabulary; if a token is not present, it is mapped to `<UNK>`. If a sequence is shorter than 64 tokens, it is padded on the right with `<PAD>` tokens until length 64. If it is longer, it is truncated to the first 64 tokens. This produces a dense matrix of shape (number_of_examples, 64), which is used as input to the model.

### 4. Model Architecture

At a high level, the model is a Transformer encoder stacked on top of learned token embeddings and sinusoidal positional encodings, followed by a global pooling operation and a linear classification head. It is a pure encoder architecture tailored for sequence-level classification rather than sequence-to-sequence generation.

First, input token IDs are passed through an embedding layer that maps each vocabulary index to a continuous vector of dimension `d_model = 128`. These token embeddings capture distributional semantics learned during training. Since embeddings alone do not encode the order of tokens, the model adds positional encodings. A standard sinusoidal positional encoding is precomputed for positions up to a maximum length and added elementwise to the token embeddings. This provides the model with information about the absolute position of each token in the sequence.

The resulting sequence of encoded token representations is processed by a stack of Transformer encoder layers. In this implementation, there are `NUM_LAYERS = 2` such layers, each consisting of a multi-head self-attention sublayer followed by a position-wise feed-forward network, both wrapped in residual connections and layer normalization.

Within each encoder block, multi-head self-attention projects the input sequence into queries, keys, and values of dimension `d_k = d_model / num_heads`, where `num_heads = 8`. For each head, attention weights are computed as scaled dot products between queries and keys, divided by the square root of `d_k` to ensure stable gradients. A softmax is then applied to obtain a probability distribution over positions, and a weighted sum of the value vectors is taken. The outputs of all heads are concatenated and passed through a final linear projection back to `d_model`. Self-attention lets each token representation attend to all other tokens in the sequence, enabling the model to capture context such as negation and long-distance sentiment cues.

The output of the attention sublayer is added back to the original input via a residual connection and normalized using layer normalization. A feed-forward sublayer then applies a two-layer MLP with hidden dimension `d_ff = 256` and ReLU activation, followed again by residual addition and layer normalization. This combination allows the model to perform both contextual mixing (via attention) and local non-linear transformations (via the feed-forward network).

After the final encoder layer, the model applies global average pooling over the sequence dimension, computing the mean of the hidden states across all time steps. This yields a single `d_model`-dimensional vector that summarizes the entire review. A dropout layer with dropout probability 0.1 is applied during training to reduce overfitting. Finally, a linear classification head maps this pooled representation to `num_classes = 2` logits, corresponding to negative and positive sentiment. The predicted class is obtained via an argmax over these logits.

In summary, self-attention in this model works by dynamically weighting each token’s contribution to every other token’s representation based on learned similarity scores. Tokens with stronger semantic or syntactic relevance to a given position receive higher attention weights, enabling the model to focus on the most informative parts of the review for sentiment prediction.

### 5. Training Pipeline

The training loop is implemented around a standard supervised learning pipeline for PyTorch models. The loss function used is cross-entropy loss (`nn.CrossEntropyLoss`), which is appropriate for multi-class (here, binary) classification with mutually exclusive classes. It directly optimizes the log-likelihood of the correct label given the model’s logits.

Optimization is performed using the Adam optimizer (`torch.optim.Adam`) with an initial learning rate of `LR = 0.001`. Adam is chosen for its robustness to gradient scaling and its ability to adaptively tune learning rates per parameter, which is beneficial when training Transformer-style models. To encourage better convergence, a learning rate scheduler (`StepLR`) is applied: the learning rate is multiplied by `gamma = 0.5` every `step_size = 2` epochs. This gradually reduces the learning rate as training progresses, helping the model to fine-tune around a local optimum.

Batches are constructed using `DataLoader` objects with a batch size of `BATCH_SIZE = 32`. The training loader shuffles the data at each epoch, while the validation loader does not. The model is trained for `EPOCHS = 5` full passes over the training set. Dropout with probability 0.1 is applied to the embeddings and intermediate representations, providing regularization by randomly masking components during training. No explicit early stopping is implemented in code; instead, performance is monitored on the validation set after each epoch via evaluation metrics. If desired, early stopping could be implemented externally by tracking the best validation F1 score and halting when it stops improving.

### 6. Evaluation Metrics

The evaluation function computes four metrics on the validation (or test) data: accuracy, precision, recall, and F1-score. Predictions are obtained by taking the argmax over the model’s logits for each example.

Accuracy measures the overall proportion of correctly classified reviews and is a natural baseline metric for binary sentiment classification. However, accuracy alone can be misleading if the dataset is imbalanced or if different error types have different costs. Therefore, precision, recall, and F1-score (with binary averaging) are also reported. Precision captures the fraction of predicted positive reviews that are truly positive, recall captures the fraction of true positive reviews that are correctly identified, and the F1-score is the harmonic mean of precision and recall. For sentiment analysis where both false positives and false negatives are important, F1-score provides a more informative single-number summary than accuracy alone.

### 7. Experimental Results (Updated with Model-Size Study)

The training script now explicitly explores model complexity by training three Transformer sizes:

- `small`: `d_model=64`, `num_heads=4`, `num_layers=1`, `d_ff=128`
- `medium`: `d_model=128`, `num_heads=8`, `num_layers=2`, `d_ff=256`
- `large`: `d_model=256`, `num_heads=8`, `num_layers=4`, `d_ff=512`

For each size, the code reports training loss and validation metrics (accuracy, precision, recall, F1-score). It also computes trainable parameter counts and prints a final comparison table so performance can be compared against model complexity in a single run.

In addition to console logging, the script saves a human-readable report:

- `assignment_text/saved_model/transformer_imdb_experiment_report.md`

This report includes run metadata, per-size architecture details, checkpoint paths, parameter counts, and metric summaries, making the experiment easy to share and inspect.

### 8. Model Analysis and Error Inspection

The main strength of this model is its ability to exploit contextual information across the entire review via self-attention. Unlike bag-of-words or simple n-gram models, it can represent interactions between distant tokens, such as handling negation (“not good”) or nuanced phrases that influence sentiment only in context. Additionally, the use of positional encoding preserves word order, which is crucial for distinguishing, for example, “not only good but great” from superficially similar but differently ordered phrases.

However, several weaknesses are apparent. First, the tokenizer is very simple and does not handle subword units, morphology, or out-of-vocabulary words in a sophisticated way. Rare words are collapsed into the `<UNK>` token, which can obscure important sentiment cues. Second, reviews are truncated to 64 tokens; for very long reviews, important information appearing later in the text may be discarded, potentially harming performance. Third, the model capacity (two encoder layers with `d_model = 128`) is modest compared to modern large-scale Transformers and may limit performance on more challenging or nuanced reviews.

From an overfitting perspective, validation metrics track training performance reasonably well, and the use of dropout and a relatively small model size helps keep overfitting under control. That said, with only 5 epochs of training, there is also a risk of underfitting: the model might not fully exploit the available training data. Adding more epochs with careful learning rate scheduling or early stopping based on validation F1 could further improve performance.

Potential improvements include increasing the maximum sequence length to capture more of each review, adopting a more advanced tokenizer such as Byte-Pair Encoding (BPE) or WordPiece to better handle rare words and subwords, and scaling up the model (more layers, larger `d_model`, larger `d_ff`) if computational resources allow. Pretraining on a large unlabeled corpus and then fine-tuning on IMDB would also likely yield significant gains, following the standard transfer learning paradigm for NLP.

For concrete examples, error-analysis instances are stored in:

- `assignment_text/documentation/error_analysis.json`

Based on the current file (`10` sampled misclassified reviews), the dominant failure mode is clear: all listed mistakes are **false positives** where the true label is negative (`0`) but the model predicts positive (`1`). This indicates the model is comparatively weaker at recognizing subtle or mixed negative sentiment.

Common patterns observed in these errors include:

- **Mixed sentiment with mild praise + stronger criticism**: e.g., "worth a rental" or "semi-alright action" appears early, but overall judgment is negative.
- **Sarcasm and irony**: phrases such as "only thing good..." or "top of the garbage list" are difficult for the model to interpret reliably.
- **Contrastive structure**: reviews that begin with positive setup and then pivot to criticism are often classified as positive.
- **Soft negative wording**: "average", "passable", "not really worth watching again" can be semantically negative but lexically less explicit.
- **Calibration issue on borderline negatives**: several wrong predictions have moderate confidence for the positive class, suggesting uncertain decision boundaries around nuanced negative reviews.

This error profile suggests improvements should prioritize better handling of nuanced negative language (especially sarcasm and contrastive discourse), for example via richer tokenization (subword methods), longer context windows, and additional hard-negative examples during training.

### 9. Conclusion

This implementation demonstrates that a relatively small Transformer encoder can perform effective sentiment analysis on the IMDB dataset using only word-level tokenization and simple preprocessing. Through self-attention and positional encoding, the model captures long-range dependencies and focuses on sentiment-relevant parts of each review, achieving strong validation performance compared with simple baselines.

From this assignment, the key takeaways are that Transformer-based models are flexible and powerful for text classification tasks and that even a compact architecture benefits from core Transformer ideas such as multi-head self-attention, residual connections, and normalization. At the same time, overall performance is sensitive to choices in tokenization, sequence length, and model capacity, highlighting the importance of careful design and experimentation when applying Transformers to NLP problems.

### 10. Reproducibility, Code Organization, and Key Implementation Details

The project is organized around core scripts and a directory for saved artifacts:

- `assignment_text/code/c1.py`: End-to-end training/evaluation script with multi-size experiments (`small`, `medium`, `large`). It saves per-size checkpoints and a markdown experiment report.
- `assignment_text/code/c1_analysis.py`: Evaluation and qualitative analysis script for loading a trained checkpoint and inspecting misclassifications. It supports `--model_size` to select `small`, `medium`, or `large`.
- `assignment_text/saved_model/transformer_imdb_small.pt`, `assignment_text/saved_model/transformer_imdb_medium.pt`, `assignment_text/saved_model/transformer_imdb_large.pt`: Size-specific checkpoints.
- `assignment_text/saved_model/transformer_imdb_experiment_report.md`: Experiment summary report generated after training.
- `assignment_text/saved_model/transformer_imdb.pt`: Compatibility summary file containing best-model metadata and all results.

Within `c1.py`, the data preprocessing pipeline is implemented by `tokenize`, `build_vocab`, and `preprocess_data`. The `tokenize` function normalizes and splits raw review text into lowercase word tokens, `build_vocab` constructs a frequency-based mapping from tokens to integer indices using only the training texts (reserving indices 0 and 1 for padding and unknown tokens), and `preprocess_data` converts each review into a fixed-length sequence of token IDs via truncation and padding. The `IMDBDataset` class wraps these sequences and labels into a PyTorch Dataset, and `DataLoader` objects provide mini-batches for training and validation.

The model components are built in a modular fashion. `PositionalEncoding` implements sinusoidal position encodings that are added to token embeddings, `MultiHeadAttention` performs scaled dot-product self-attention over the sequence across multiple heads, and `TransformerEncoderBlock` combines multi-head attention, a position-wise feed-forward network, residual connections, and layer normalization into a single encoder layer. The `TransformerClassifier` class composes an embedding layer, positional encoding, a stack of encoder blocks, global average pooling over the sequence dimension, and a final linear layer that outputs logits for the two sentiment classes.

The training and evaluation logic is encapsulated in `train_model` and `evaluate_model`. `train_model` iterates over the training DataLoader for a specified number of epochs, computes the cross-entropy loss, performs backpropagation and optimizer updates (Adam with an initial learning rate of 0.001), applies a StepLR scheduler that halves the learning rate every two epochs, and reports validation accuracy, precision, recall, and F1-score at the end of each epoch. `evaluate_model` runs the model in evaluation mode on a given DataLoader and aggregates predictions and labels to compute the same metrics. The function `load_imdb_texts` serves as a thin wrapper around the HuggingFace `datasets.load_dataset` API and clearly documents that the underlying data originates from the Large Movie Review Dataset (Maas et al., 2011).

To reproduce the experiments, the main dependencies are PyTorch, scikit-learn, NumPy, and the HuggingFace `datasets` library. The script is designed to use a CUDA GPU if available, otherwise it falls back to CPU.

Global hyperparameters are defined near the end of `c1.py`:

- `MAX_VOCAB = 5000` (maximum vocabulary size)
- `MAX_LEN = 64` (maximum sequence length in tokens)
- `BATCH_SIZE = 32` (mini-batch size)
- `EPOCHS = 5` (number of training epochs)
- `LR = 0.001` (initial learning rate for Adam)

Model-size-specific hyperparameters are defined in `MODEL_SIZES`:

- `small`: `d_model=64`, `num_heads=4`, `num_layers=1`, `d_ff=128`
- `medium`: `d_model=128`, `num_heads=8`, `num_layers=2`, `d_ff=256`
- `large`: `d_model=256`, `num_heads=8`, `num_layers=4`, `d_ff=512`

Running `c1.py` end-to-end will (1) download and preprocess the IMDB training split, (2) train all three model sizes on the training portion, (3) evaluate each model on the validation portion, and (4) save checkpoints plus an experiment report to the `assignment_text/saved_model` directory. Running `c1_analysis.py` with `--model_size` then allows targeted evaluation and qualitative error analysis for the selected size. With these components documented, the experiment is fully reproducible and can be extended for further study.