--- language: en license: mit datasets: - glue/rte tags: - text-classification - glue - bert - recognizing textual entailment - assignment - mean-pooling metrics: - accuracy --- # BERT + Mean Pooling + MLP for RTE (EEE 486/586 Assignment - Part 2) This model is a fine-tuned version of `bert-base-uncased` on the RTE (Recognizing Textual Entailment) task from the GLUE benchmark. It was developed as part of the EEE 486/586 Statistical Foundations of Natural Language Processing course assignment (Part 2). ## Model Architecture This model explores an alternative to the standard `BertForSequenceClassification` architecture: - Uses the standard `bert-base-uncased` model to obtain token embeddings (`last_hidden_state`). - **Mean Pooling:** Instead of using the [CLS] token's pooler output, it calculates the mean of the `last_hidden_state` across all non-padding tokens (using the attention mask) to get a single sequence representation vector. - **MLP Classifier Head:** The mean-pooled representation is passed through dropout and then a multi-layer perceptron (MLP) head for classification. The MLP structure was determined by the hyperparameter search (`hidden_size_multiplier=4`). - The final layer outputs logits for the 2 classes (entailment/not\_entailment). **Note:** Because this uses a custom architecture (`BertMeanPoolClassifier`), it cannot be loaded directly using `AutoModelForSequenceClassification.from_pretrained()`. You need the model's class definition (provided in the assignment code/report) and then load the `state_dict` (`pytorch_model.bin`) into an instance of that class. ## Performance The model was trained using hyperparameters found via Optuna. The final training run (5 epochs with early stopping based on validation accuracy) achieved the following: - **Best Validation Accuracy:** **0.6931** (achieved at Epoch 3) - Final Validation Accuracy (Epoch 5): 0.6823 - Final Validation Loss (Epoch 5): 1.4258 - Final Training Loss (Epoch 5): 0.0797 The model showed strong fitting capabilities but exhibited signs of overfitting after epoch 3, as indicated by the rising validation loss. The best checkpoint based on accuracy was saved. ## Best Hyperparameters (from Optuna) | Hyperparameter | Value | |--------------------------|-----------------------| | Learning Rate | 3.518e-05 | | Max Sequence Length | 128 | | Dropout Rate (Classifier)| 0.4 | | Batch Size | 16 | | Hidden Size Multiplier | 4 | | Epochs (Optuna Best Trial) | 3 | ## Intended Use & Limitations This model is intended for the RTE task as part of the specific course assignment. Due to its custom architecture, direct loading via `AutoModelForSequenceClassification` is not supported. ```