Update README.md

967eab3 verified about 1 month ago

4.15 kB

	---
	language: en
	license: apache-2.0
	datasets:
	- imdb
	metrics:
	- accuracy
	base_model: distilbert-base-uncased
	pipeline_tag: text-classification
	tags:
	- sentiment-analysis
	- imdb
	- text-classification
	- distilbert
	library_name: transformers
	---

	## Model Description
	`grojeda/distilbert-sentiment-imdb` fine-tunes `distilbert-base-uncased` for binary sentiment classification over the IMDB Large Movie Review dataset. Reviews are tokenized with the base WordPiece tokenizer, truncated to 384 tokens, and dynamically padded via `DataCollatorWithPadding`. Training uses Hugging Face Transformers (PyTorch) and produces weights that inherit the Apache 2.0 terms of the base checkpoint and dataset license constraints.

	## Intended Uses & Limitations
	- Recommended for English sentiment analysis of movie-style product reviews, benchmarking compact BERT-family encoders, and serving as a starting point for domain adaptation.
	- Not suitable for multilingual sentiment, sarcasm detection, fine-grained emotion tagging, or high-stakes moderation without human review.
	- Known limitations include potential degradation on inputs longer than 384 tokens, sensitivity to domain shifts (legal/medical jargon), and lack of calibration for imbalanced datasets.
	- Ethical considerations: the model may reproduce societal biases present in IMDB reviews; avoid using outputs for demographic inference or automated enforcement without auditing.

	## Training Details
	- Dataset: `imdb` from Hugging Face Datasets. The 25k training split was partitioned 90/10 (seed 42) into train (≈22.5k) and validation (≈2.5k). The official 25k test split remained untouched until evaluation.
	- Hyperparameters: learning rate 2e-5 (default scheduler), weight decay 0.01, AdamW optimizer, max length 384, per-device batch sizes 16 (train) / 32 (eval), no gradient accumulation, dropout per DistilBERT defaults.
	- Schedule: 2 epochs (~2,814 optimizer steps). Evaluation and checkpointing occur at each epoch with best-checkpoint reloading enabled.
	- Compute: Trained on a single consumer GPU (RTX 3050 4GB). Environment: Transformers 4.57.3 and PyTorch 2.9.1.

	## Evaluation Results
	```json
	{
	"task": {
	"type": "text-classification",
	"name": "Sentiment Analysis"
	},
	"dataset": {
	"name": "imdb",
	"type": "imdb",
	"split": "test"
	},
	"metrics": [
	{
	"type": "accuracy",
	"name": "Accuracy",
	"value": 0.9256
	},
	{
	"type": "loss",
	"name": "CrossEntropyLoss",
	"value": 0.2435
	}
	]
	}
	```

	## How to Use
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "grojeda/distilbert-sentiment-imdb"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	text = "The pacing drags, but the performances are heartfelt."
	inputs = tokenizer(text, return_tensors="pt")
	with torch.no_grad():
	probs = model(**inputs).logits.softmax(dim=-1)

	label_id = int(probs.argmax())
	print(model.config.id2label[label_id], float(probs[0, label_id]))
	```

	## Limitations & Biases
	- Domain bias toward movie reviews; expect weaker performance on other domains without fine-tuning.
	- Only English data was used; multilingual inputs are unsupported.
	- Dataset balance (50/50) can lead to overconfidence when deployed on skewed class distributions.
	- User-generated content may embed stereotypes or offensive language that the model can echo.

	## Citation
	```
	@article{sanh2019distilbert,
	title = {DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter},
	author = {Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
	journal = {NeurIPS EMC2 Workshop},
	year = {2019}
	}

	@inproceedings{maas2011learning,
	title = {Learning Word Vectors for Sentiment Analysis},
	author = {Andrew L. Maas and Raymond E. Daly and Peter T. Pham and Dan Huang and Andrew Y. Ng and Christopher Potts},
	booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
	year = {2011}
	}
	```