--- language: en license: apache-2.0 datasets: - imdb metrics: - accuracy base_model: distilbert-base-uncased pipeline_tag: text-classification tags: - sentiment-analysis - imdb - text-classification - distilbert library_name: transformers --- ## Model Description `grojeda/distilbert-sentiment-imdb` fine-tunes `distilbert-base-uncased` for binary sentiment classification over the IMDB Large Movie Review dataset. Reviews are tokenized with the base WordPiece tokenizer, truncated to 384 tokens, and dynamically padded via `DataCollatorWithPadding`. Training uses Hugging Face Transformers (PyTorch) and produces weights that inherit the Apache 2.0 terms of the base checkpoint and dataset license constraints. ## Intended Uses & Limitations - Recommended for English sentiment analysis of movie-style product reviews, benchmarking compact BERT-family encoders, and serving as a starting point for domain adaptation. - Not suitable for multilingual sentiment, sarcasm detection, fine-grained emotion tagging, or high-stakes moderation without human review. - Known limitations include potential degradation on inputs longer than 384 tokens, sensitivity to domain shifts (legal/medical jargon), and lack of calibration for imbalanced datasets. - Ethical considerations: the model may reproduce societal biases present in IMDB reviews; avoid using outputs for demographic inference or automated enforcement without auditing. ## Training Details - **Dataset:** `imdb` from Hugging Face Datasets. The 25k training split was partitioned 90/10 (seed 42) into train (≈22.5k) and validation (≈2.5k). The official 25k test split remained untouched until evaluation. - **Hyperparameters:** learning rate 2e-5 (default scheduler), weight decay 0.01, AdamW optimizer, max length 384, per-device batch sizes 16 (train) / 32 (eval), no gradient accumulation, dropout per DistilBERT defaults. - **Schedule:** 2 epochs (~2,814 optimizer steps). Evaluation and checkpointing occur at each epoch with best-checkpoint reloading enabled. - **Compute:** Trained on a single consumer GPU (RTX 3050 4GB). Environment: Transformers 4.57.3 and PyTorch 2.9.1. ## Evaluation Results ```json { "task": { "type": "text-classification", "name": "Sentiment Analysis" }, "dataset": { "name": "imdb", "type": "imdb", "split": "test" }, "metrics": [ { "type": "accuracy", "name": "Accuracy", "value": 0.9256 }, { "type": "loss", "name": "CrossEntropyLoss", "value": 0.2435 } ] } ``` ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "grojeda/distilbert-sentiment-imdb" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) text = "The pacing drags, but the performances are heartfelt." inputs = tokenizer(text, return_tensors="pt") with torch.no_grad(): probs = model(**inputs).logits.softmax(dim=-1) label_id = int(probs.argmax()) print(model.config.id2label[label_id], float(probs[0, label_id])) ``` ## Limitations & Biases - Domain bias toward movie reviews; expect weaker performance on other domains without fine-tuning. - Only English data was used; multilingual inputs are unsupported. - Dataset balance (50/50) can lead to overconfidence when deployed on skewed class distributions. - User-generated content may embed stereotypes or offensive language that the model can echo. ## Citation ``` @article{sanh2019distilbert, title = {DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter}, author = {Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf}, journal = {NeurIPS EMC2 Workshop}, year = {2019} } @inproceedings{maas2011learning, title = {Learning Word Vectors for Sentiment Analysis}, author = {Andrew L. Maas and Raymond E. Daly and Peter T. Pham and Dan Huang and Andrew Y. Ng and Christopher Potts}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, year = {2011} } ```