|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- imdb |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: distilbert-base-uncased |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- sentiment-analysis |
|
|
- imdb |
|
|
- text-classification |
|
|
- distilbert |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
## Model Description |
|
|
`grojeda/distilbert-sentiment-imdb` fine-tunes `distilbert-base-uncased` for binary sentiment classification over the IMDB Large Movie Review dataset. Reviews are tokenized with the base WordPiece tokenizer, truncated to 384 tokens, and dynamically padded via `DataCollatorWithPadding`. Training uses Hugging Face Transformers (PyTorch) and produces weights that inherit the Apache 2.0 terms of the base checkpoint and dataset license constraints. |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
- Recommended for English sentiment analysis of movie-style product reviews, benchmarking compact BERT-family encoders, and serving as a starting point for domain adaptation. |
|
|
- Not suitable for multilingual sentiment, sarcasm detection, fine-grained emotion tagging, or high-stakes moderation without human review. |
|
|
- Known limitations include potential degradation on inputs longer than 384 tokens, sensitivity to domain shifts (legal/medical jargon), and lack of calibration for imbalanced datasets. |
|
|
- Ethical considerations: the model may reproduce societal biases present in IMDB reviews; avoid using outputs for demographic inference or automated enforcement without auditing. |
|
|
|
|
|
## Training Details |
|
|
- **Dataset:** `imdb` from Hugging Face Datasets. The 25k training split was partitioned 90/10 (seed 42) into train (≈22.5k) and validation (≈2.5k). The official 25k test split remained untouched until evaluation. |
|
|
- **Hyperparameters:** learning rate 2e-5 (default scheduler), weight decay 0.01, AdamW optimizer, max length 384, per-device batch sizes 16 (train) / 32 (eval), no gradient accumulation, dropout per DistilBERT defaults. |
|
|
- **Schedule:** 2 epochs (~2,814 optimizer steps). Evaluation and checkpointing occur at each epoch with best-checkpoint reloading enabled. |
|
|
- **Compute:** Trained on a single consumer GPU (RTX 3050 4GB). Environment: Transformers 4.57.3 and PyTorch 2.9.1. |
|
|
|
|
|
## Evaluation Results |
|
|
```json |
|
|
{ |
|
|
"task": { |
|
|
"type": "text-classification", |
|
|
"name": "Sentiment Analysis" |
|
|
}, |
|
|
"dataset": { |
|
|
"name": "imdb", |
|
|
"type": "imdb", |
|
|
"split": "test" |
|
|
}, |
|
|
"metrics": [ |
|
|
{ |
|
|
"type": "accuracy", |
|
|
"name": "Accuracy", |
|
|
"value": 0.9256 |
|
|
}, |
|
|
{ |
|
|
"type": "loss", |
|
|
"name": "CrossEntropyLoss", |
|
|
"value": 0.2435 |
|
|
} |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
## How to Use |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_id = "grojeda/distilbert-sentiment-imdb" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_id) |
|
|
|
|
|
text = "The pacing drags, but the performances are heartfelt." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
probs = model(**inputs).logits.softmax(dim=-1) |
|
|
|
|
|
label_id = int(probs.argmax()) |
|
|
print(model.config.id2label[label_id], float(probs[0, label_id])) |
|
|
``` |
|
|
|
|
|
## Limitations & Biases |
|
|
- Domain bias toward movie reviews; expect weaker performance on other domains without fine-tuning. |
|
|
- Only English data was used; multilingual inputs are unsupported. |
|
|
- Dataset balance (50/50) can lead to overconfidence when deployed on skewed class distributions. |
|
|
- User-generated content may embed stereotypes or offensive language that the model can echo. |
|
|
|
|
|
## Citation |
|
|
``` |
|
|
@article{sanh2019distilbert, |
|
|
title = {DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter}, |
|
|
author = {Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, |
|
|
journal = {NeurIPS EMC2 Workshop}, |
|
|
year = {2019} |
|
|
} |
|
|
|
|
|
@inproceedings{maas2011learning, |
|
|
title = {Learning Word Vectors for Sentiment Analysis}, |
|
|
author = {Andrew L. Maas and Raymond E. Daly and Peter T. Pham and Dan Huang and Andrew Y. Ng and Christopher Potts}, |
|
|
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, |
|
|
year = {2011} |
|
|
} |
|
|
``` |