minBERT — Finetuning BERT with Contrastive Learning for Sentiment Analysis

minBERT is a minimal re-implementation of BERT (Bidirectional Encoder Representations from Transformers), built from scratch in PyTorch and used to study how contrastive learning (SimCSE) improves the quality of sentence embeddings for sentiment analysis.

The idea: instead of relying on task-specific architectures, we take the contextual embedding of the [CLS] token produced by minBERT, attach a single classification layer on top, and compare how well the model performs when its embeddings are refined with contrastive pre-finetuning versus used as-is.

Model architecture

minBERT follows the original BERT-base design and loads pretrained bert-base-uncased weights:

Tokenization — a WordPiece tokenizer converts raw text into token ids, adds special tokens ([CLS] at the start, [PAD] for batching, [UNK] for out-of-vocabulary words), and produces the attention mask.
Embedding layer — each token is mapped to the sum of its token embedding and position embedding (plus a segment embedding placeholder), followed by LayerNorm and dropout. Hidden size is 768.
Encoder — a stack of 12 Transformer layers; each layer applies multi-head self-attention, an add & norm step, a feed-forward network with GELU activation, and another add & norm step.
Pooler — the final hidden state of the [CLS] token is passed through a linear layer with tanh activation to obtain the sentence representation.
Classifier head — dropout followed by a single linear layer over the pooled [CLS] embedding, producing the predicted sentiment label.

The AdamW optimizer (with first/second moment estimates, bias correction, decoupled weight decay) is used for all training runs.

SimCSE contrastive finetuning

minBERT's embeddings can be refined with the SimCSE framework. Two variants are implemented:

Unsupervised SimCSE — a positive pair is the same sentence passed through the encoder twice with different dropout masks; other sentences in the batch serve as negatives. Trained on ~115k samples from Amazon Polarity.
Supervised SimCSE — positive/negative pairs come from NLI entailment/contradiction triplets (anchor, positive, hard negative). Trained on ~265k samples.

Both use a contrastive cross-entropy loss over cosine similarities and are evaluated each epoch on STS-B via Spearman correlation (best checkpoint kept). The finetuned weights are included in minbert-model/ (unsup-cse-bert.pth, sup-cse-bert.pth).

Workflows

Everything runs through run.py:

1. Contrastive finetuning (SimCSE)

python run.py --task finetune --model unsup   # unsupervised SimCSE on Amazon Polarity
python run.py --task finetune --model sup     # supervised SimCSE on NLI

2. Sentiment classifier training & evaluation

python run.py --task train --model base  --dataset sst    --train-mode last-linear
python run.py --task train --model sup   --dataset cfimdb --train-mode full-model
python run.py --task train --model unsup --dataset sst    --train-mode full-model

--model: which encoder to start from — base (pretrained BERT), sup / unsup (SimCSE-finetuned).
--dataset: sst (Stanford Sentiment Treebank, 5-class) or cfimdb (binary movie reviews).
--train-mode: last-linear freezes BERT and trains only the classifier head; full-model finetunes everything.

Each run trains for 10 epochs, reports accuracy and macro-F1 on the dev set, and keeps the best model.

Results

After contrastive finetuning, both SimCSE encoders (supervised and unsupervised) — along with the pretrained baseline — were trained on SST and CFIMDB in both modes.

Model	SST (last-linear)	SST (full-model)	CFIMDB (last-linear)	CFIMDB (full-model)
Pretrained minBERT	0.393	0.524	0.804	0.963
minBERT + SimCSE (unsup)	0.423	0.516	0.833	0.963
minBERT + SimCSE (sup)	0.464	0.523	0.931	0.971

The dev-set accuracy shows:

last-linear: with the encoder frozen (no need to finetune all 12 BERT layers), supervised SimCSE jumps from 0.804 → 0.931 on CFIMDB and 0.393 → 0.464 on SST compared to the baseline. Unsupervised SimCSE also improves on both datasets (0.833 and 0.423), landing between the two.
full-model of supervised SimCSE is the strongest overall: it tops CFIMDB at 0.971, while the three encoders are roughly on par on SST (~0.52) — full finetuning largely washes out the differences in starting embeddings.

Downloads last month: -; Downloads are not tracked for this model. How to track