grojeda's picture
Update README.md
967eab3 verified
---
language: en
license: apache-2.0
datasets:
- imdb
metrics:
- accuracy
base_model: distilbert-base-uncased
pipeline_tag: text-classification
tags:
- sentiment-analysis
- imdb
- text-classification
- distilbert
library_name: transformers
---
## Model Description
`grojeda/distilbert-sentiment-imdb` fine-tunes `distilbert-base-uncased` for binary sentiment classification over the IMDB Large Movie Review dataset. Reviews are tokenized with the base WordPiece tokenizer, truncated to 384 tokens, and dynamically padded via `DataCollatorWithPadding`. Training uses Hugging Face Transformers (PyTorch) and produces weights that inherit the Apache 2.0 terms of the base checkpoint and dataset license constraints.
## Intended Uses & Limitations
- Recommended for English sentiment analysis of movie-style product reviews, benchmarking compact BERT-family encoders, and serving as a starting point for domain adaptation.
- Not suitable for multilingual sentiment, sarcasm detection, fine-grained emotion tagging, or high-stakes moderation without human review.
- Known limitations include potential degradation on inputs longer than 384 tokens, sensitivity to domain shifts (legal/medical jargon), and lack of calibration for imbalanced datasets.
- Ethical considerations: the model may reproduce societal biases present in IMDB reviews; avoid using outputs for demographic inference or automated enforcement without auditing.
## Training Details
- **Dataset:** `imdb` from Hugging Face Datasets. The 25k training split was partitioned 90/10 (seed 42) into train (≈22.5k) and validation (≈2.5k). The official 25k test split remained untouched until evaluation.
- **Hyperparameters:** learning rate 2e-5 (default scheduler), weight decay 0.01, AdamW optimizer, max length 384, per-device batch sizes 16 (train) / 32 (eval), no gradient accumulation, dropout per DistilBERT defaults.
- **Schedule:** 2 epochs (~2,814 optimizer steps). Evaluation and checkpointing occur at each epoch with best-checkpoint reloading enabled.
- **Compute:** Trained on a single consumer GPU (RTX 3050 4GB). Environment: Transformers 4.57.3 and PyTorch 2.9.1.
## Evaluation Results
```json
{
"task": {
"type": "text-classification",
"name": "Sentiment Analysis"
},
"dataset": {
"name": "imdb",
"type": "imdb",
"split": "test"
},
"metrics": [
{
"type": "accuracy",
"name": "Accuracy",
"value": 0.9256
},
{
"type": "loss",
"name": "CrossEntropyLoss",
"value": 0.2435
}
]
}
```
## How to Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "grojeda/distilbert-sentiment-imdb"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "The pacing drags, but the performances are heartfelt."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
probs = model(**inputs).logits.softmax(dim=-1)
label_id = int(probs.argmax())
print(model.config.id2label[label_id], float(probs[0, label_id]))
```
## Limitations & Biases
- Domain bias toward movie reviews; expect weaker performance on other domains without fine-tuning.
- Only English data was used; multilingual inputs are unsupported.
- Dataset balance (50/50) can lead to overconfidence when deployed on skewed class distributions.
- User-generated content may embed stereotypes or offensive language that the model can echo.
## Citation
```
@article{sanh2019distilbert,
title = {DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter},
author = {Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
journal = {NeurIPS EMC2 Workshop},
year = {2019}
}
@inproceedings{maas2011learning,
title = {Learning Word Vectors for Sentiment Analysis},
author = {Andrew L. Maas and Raymond E. Daly and Peter T. Pham and Dan Huang and Andrew Y. Ng and Christopher Potts},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
year = {2011}
}
```