|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- nyu-mll/glue |
|
|
- stanfordnlp/sst2 |
|
|
base_model: |
|
|
- google-bert/bert-base-uncased |
|
|
tags: |
|
|
- sentiment-analysis |
|
|
- text-classification |
|
|
- transformers |
|
|
- pytorch |
|
|
- bert |
|
|
- sst2 |
|
|
- glue |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
# BERT-base-uncased fine-tuned on SST-2 (GLUE) |
|
|
|
|
|
This repository contains a `bert-base-uncased` model fine-tuned for **binary sentiment classification** on the [GLUE/SST-2](https://huggingface.co/datasets/glue/viewer/sst2) dataset. |
|
|
|
|
|
## Model summary |
|
|
|
|
|
- **Task**: sentiment analysis (binary classification) |
|
|
- **Labels**: negative (`0`), positive (`1`) |
|
|
- **Base model**: `bert-base-uncased` |
|
|
- **Library**: Transformers (`Trainer` API) |
|
|
- **Note**: In the training notebook, the model was fine-tuned on a small subset (640 train / 640 validation) for demonstration purposes. For production use, fine-tune on the full dataset and validate thoroughly. |
|
|
|
|
|
## Intended uses |
|
|
|
|
|
### ✅ Supported |
|
|
- Quick demos of sentiment classification on English sentences |
|
|
- Educational examples of fine-tuning with `Trainer` |
|
|
- Baseline experiments on SST-2-like sentiment data |
|
|
|
|
|
### ⚠️ Not recommended |
|
|
- High-stakes or safety-critical decisions (medical, legal, hiring, etc.) |
|
|
- Domains significantly different from SST-2 (e.g., clinical notes, finance news) without further fine-tuning |
|
|
- Non-English text (model and data are English-focused) |
|
|
|
|
|
## Limitations and biases |
|
|
|
|
|
- **Dataset bias**: SST-2 reflects movie review sentiment distribution and language patterns; performance may degrade on other domains. |
|
|
- **Small fine-tuning subset**: if you trained on 640 samples, results are not representative of the full SST-2 benchmark. |
|
|
- **Short-text behavior**: very short/ambiguous or sarcastic statements can be misclassified. |
|
|
- **Offensive/toxic content**: the model may output confident predictions on harmful text; it does not provide safety filtering. |
|
|
|
|
|
## Training data |
|
|
|
|
|
Fine-tuning used the GLUE benchmark dataset configuration **SST-2** (Stanford Sentiment Treebank v2 as used in GLUE). |
|
|
|
|
|
- **Dataset**: `glue`, config `sst2` |
|
|
- **Text field**: `sentence` |
|
|
- **Label field**: `label` (`0`/`1`) |
|
|
|
|
|
In the provided Colab: |
|
|
- `train`: selected `range(640)` |
|
|
- `validation`: selected `range(640)` |
|
|
- `test`: predictions generated without labels (GLUE test split) |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Preprocessing |
|
|
- Tokenizer: `AutoTokenizer.from_pretrained("bert-base-uncased")` |
|
|
- Truncation enabled (`truncation=True`) |
|
|
- Dynamic padding via `DataCollatorWithPadding` |
|
|
|
|
|
### Hyperparameters (from Colab) |
|
|
- `epochs`: 3 |
|
|
- `learning_rate`: 2e-5 |
|
|
- `batch_size`: 16 (per device) |
|
|
- `weight_decay`: 0.01 |
|
|
- `evaluation`: each epoch |
|
|
- `checkpointing`: each epoch |
|
|
- `best model selection`: accuracy on validation |
|
|
- `logging`: disabled (`report_to="none"`) |
|
|
|
|
|
## Results (validation) |
|
|
|
|
|
- **Accuracy**: 0.8625 |
|
|
- **Loss**: 0.33919745683670044 |
|
|
|
|
|
> *(Optional: add confusion matrix, F1, etc. if available)* |
|
|
|