BERT + Mean Pooling + MLP for RTE (EEE 486/586 Assignment - Part 2)

This model is a fine-tuned version of bert-base-uncased on the RTE (Recognizing Textual Entailment) task from the GLUE benchmark. It was developed as part of the EEE 486/586 Statistical Foundations of Natural Language Processing course assignment (Part 2).

Model Architecture

This model explores an alternative to the standard BertForSequenceClassification architecture:

  • Uses the standard bert-base-uncased model to obtain token embeddings (last_hidden_state).
  • Mean Pooling: Instead of using the [CLS] token's pooler output, it calculates the mean of the last_hidden_state across all non-padding tokens (using the attention mask) to get a single sequence representation vector.
  • MLP Classifier Head: The mean-pooled representation is passed through dropout and then a multi-layer perceptron (MLP) head for classification. The MLP structure was determined by the hyperparameter search (hidden_size_multiplier=4).
  • The final layer outputs logits for the 2 classes (entailment/not_entailment).

Note: Because this uses a custom architecture (BertMeanPoolClassifier), it cannot be loaded directly using AutoModelForSequenceClassification.from_pretrained(). You need the model's class definition (provided in the assignment code/report) and then load the state_dict (pytorch_model.bin) into an instance of that class.

Performance

The model was trained using hyperparameters found via Optuna. The final training run (5 epochs with early stopping based on validation accuracy) achieved the following:

  • Best Validation Accuracy: 0.6931 (achieved at Epoch 3)
  • Final Validation Accuracy (Epoch 5): 0.6823
  • Final Validation Loss (Epoch 5): 1.4258
  • Final Training Loss (Epoch 5): 0.0797

The model showed strong fitting capabilities but exhibited signs of overfitting after epoch 3, as indicated by the rising validation loss. The best checkpoint based on accuracy was saved.

Best Hyperparameters (from Optuna)

Hyperparameter Value
Learning Rate 3.518e-05
Max Sequence Length 128
Dropout Rate (Classifier) 0.4
Batch Size 16
Hidden Size Multiplier 4
Epochs (Optuna Best Trial) 3

Intended Use & Limitations

This model is intended for the RTE task as part of the specific course assignment. Due to its custom architecture, direct loading via AutoModelForSequenceClassification is not supported.


Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support