stanfordnlp/imdb
Viewer • Updated • 100k • 271k • 380
How to use grojeda/distilbert-sentiment-imdb with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="grojeda/distilbert-sentiment-imdb") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("grojeda/distilbert-sentiment-imdb")
model = AutoModelForSequenceClassification.from_pretrained("grojeda/distilbert-sentiment-imdb")grojeda/distilbert-sentiment-imdb fine-tunes distilbert-base-uncased for binary sentiment classification over the IMDB Large Movie Review dataset. Reviews are tokenized with the base WordPiece tokenizer, truncated to 384 tokens, and dynamically padded via DataCollatorWithPadding. Training uses Hugging Face Transformers (PyTorch) and produces weights that inherit the Apache 2.0 terms of the base checkpoint and dataset license constraints.
imdb from Hugging Face Datasets. The 25k training split was partitioned 90/10 (seed 42) into train (≈22.5k) and validation (≈2.5k). The official 25k test split remained untouched until evaluation.{
"task": {
"type": "text-classification",
"name": "Sentiment Analysis"
},
"dataset": {
"name": "imdb",
"type": "imdb",
"split": "test"
},
"metrics": [
{
"type": "accuracy",
"name": "Accuracy",
"value": 0.9256
},
{
"type": "loss",
"name": "CrossEntropyLoss",
"value": 0.2435
}
]
}
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "grojeda/distilbert-sentiment-imdb"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "The pacing drags, but the performances are heartfelt."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
probs = model(**inputs).logits.softmax(dim=-1)
label_id = int(probs.argmax())
print(model.config.id2label[label_id], float(probs[0, label_id]))
@article{sanh2019distilbert,
title = {DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter},
author = {Victor Sanh and Lysandre Debut and Julien Chaumond and Thomas Wolf},
journal = {NeurIPS EMC2 Workshop},
year = {2019}
}
@inproceedings{maas2011learning,
title = {Learning Word Vectors for Sentiment Analysis},
author = {Andrew L. Maas and Raymond E. Daly and Peter T. Pham and Dan Huang and Andrew Y. Ng and Christopher Potts},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
year = {2011}
}
Base model
distilbert/distilbert-base-uncased