BERT + Mean Pooling + MLP for RTE (EEE 486/586 Assignment - Part 2)
This model is a fine-tuned version of bert-base-uncased on the RTE (Recognizing Textual Entailment) task from the GLUE benchmark. It was developed as part of the EEE 486/586 Statistical Foundations of Natural Language Processing course assignment (Part 2).
Model Architecture
This model explores an alternative to the standard BertForSequenceClassification architecture:
- Uses the standard
bert-base-uncasedmodel to obtain token embeddings (last_hidden_state). - Mean Pooling: Instead of using the [CLS] token's pooler output, it calculates the mean of the
last_hidden_stateacross all non-padding tokens (using the attention mask) to get a single sequence representation vector. - MLP Classifier Head: The mean-pooled representation is passed through dropout and then a multi-layer perceptron (MLP) head for classification. The MLP structure was determined by the hyperparameter search (
hidden_size_multiplier=4). - The final layer outputs logits for the 2 classes (entailment/not_entailment).
Note: Because this uses a custom architecture (BertMeanPoolClassifier), it cannot be loaded directly using AutoModelForSequenceClassification.from_pretrained(). You need the model's class definition (provided in the assignment code/report) and then load the state_dict (pytorch_model.bin) into an instance of that class.
Performance
The model was trained using hyperparameters found via Optuna. The final training run (5 epochs with early stopping based on validation accuracy) achieved the following:
- Best Validation Accuracy: 0.6931 (achieved at Epoch 3)
- Final Validation Accuracy (Epoch 5): 0.6823
- Final Validation Loss (Epoch 5): 1.4258
- Final Training Loss (Epoch 5): 0.0797
The model showed strong fitting capabilities but exhibited signs of overfitting after epoch 3, as indicated by the rising validation loss. The best checkpoint based on accuracy was saved.
Best Hyperparameters (from Optuna)
| Hyperparameter | Value |
|---|---|
| Learning Rate | 3.518e-05 |
| Max Sequence Length | 128 |
| Dropout Rate (Classifier) | 0.4 |
| Batch Size | 16 |
| Hidden Size Multiplier | 4 |
| Epochs (Optuna Best Trial) | 3 |
Intended Use & Limitations
This model is intended for the RTE task as part of the specific course assignment. Due to its custom architecture, direct loading via AutoModelForSequenceClassification is not supported.
- Downloads last month
- 1