|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: distilbert-base-uncased |
|
|
tags: |
|
|
- sentiment-analysis |
|
|
- text-classification |
|
|
- pytorch |
|
|
- distilbert |
|
|
- fine-tuned |
|
|
datasets: |
|
|
- imdb |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
widget: |
|
|
- text: "This movie is absolutely amazing! I loved every minute of it." |
|
|
example_title: "Positive Example" |
|
|
- text: "Terrible film, complete waste of time and money." |
|
|
example_title: "Negative Example" |
|
|
--- |
|
|
|
|
|
# DistilBERT Sentiment Analysis Model |
|
|
|
|
|
This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for **2-class sentiment analysis** (Positive, Negative) on movie reviews. |
|
|
|
|
|
## π― Model Description |
|
|
|
|
|
- **Model Type:** Text Classification |
|
|
- **Base Architecture:** DistilBERT (Distilled BERT) |
|
|
- **Language:** English |
|
|
- **Task:** Sentiment Analysis |
|
|
- **Classes:** 2 (Negative, Positive) |
|
|
- **Parameters:** ~66M |
|
|
- **Model Size:** ~250MB |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Using Transformers Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
classifier = pipeline("sentiment-analysis", |
|
|
model="your-username/sentiment-analysis-distilbert") |
|
|
|
|
|
# Single prediction |
|
|
result = classifier("This movie is fantastic!") |
|
|
print(result) |
|
|
# Output: [{'label': 'POSITIVE', 'score': 0.9987}] |
|
|
|
|
|
# Batch prediction |
|
|
texts = [ |
|
|
"Amazing cinematography and great acting!", |
|
|
"Boring and predictable storyline.", |
|
|
"It was an okay movie, nothing extraordinary." |
|
|
] |
|
|
results = classifier(texts) |
|
|
for text, result in zip(texts, results): |
|
|
print(f"Text: {text}") |
|
|
print(f"Sentiment: {result['label']} (Confidence: {result['score']:.3f})") |
|
|
``` |
|
|
|
|
|
### Using AutoModel and AutoTokenizer |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "your-username/sentiment-analysis-distilbert" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Prepare input |
|
|
text = "This movie exceeded my expectations!" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
# Get prediction |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
# Get predicted class |
|
|
predicted_class = torch.argmax(predictions, dim=-1).item() |
|
|
confidence = predictions[0][predicted_class].item() |
|
|
|
|
|
labels = ["NEGATIVE", POSITIVE"] |
|
|
print(f"Sentiment: {labels[predicted_class]} (Confidence: {confidence:.3f})") |
|
|
``` |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
### Dataset |
|
|
- **Source:** IMDB Movie Reviews Dataset |
|
|
- **Training Samples:** 5,000 (balanced: 1,667 per class) |
|
|
- **Evaluation Samples:** 1,000 |
|
|
- **Data Split:** 80% train, 20% validation |
|
|
- **Preprocessing:** Tokenization with DistilBERT tokenizer, max length 256 |
|
|
|
|
|
### Training Configuration |
|
|
- **Base Model:** `distilbert-base-uncased` |
|
|
- **Training Framework:** PyTorch + Transformers |
|
|
- **Optimizer:** AdamW |
|
|
- **Learning Rate:** 2e-5 |
|
|
- **Batch Size:** 8 |
|
|
- **Epochs:** 3 |
|
|
- **Warmup Steps:** 100 |
|
|
- **Weight Decay:** 0.01 |
|
|
- **Max Sequence Length:** 256 tokens |
|
|
|
|
|
### Hardware |
|
|
- **Platform:** Google Colab |
|
|
- **GPU:** Tesla T4 (15GB VRAM) |
|
|
- **Training Time:** ~30-45 minutes |
|
|
|
|
|
## π Performance |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Training Accuracy | ~95% | |
|
|
| Validation Accuracy | ~93% | |
|
|
| Training Loss | 0.12 | |
|
|
| Validation Loss | 0.18 | |
|
|
|
|
|
### Class Distribution |
|
|
- **Negative:** 33.3% (2500 samples) |
|
|
- **Positive:** 33.3% (2500 samples) |
|
|
|
|
|
## π― Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- **Movie Review Analysis:** Classify sentiment of movie reviews |
|
|
- **Product Review Sentiment:** Analyze customer feedback |
|
|
- **Social Media Monitoring:** Track sentiment in posts and comments |
|
|
- **Content Moderation:** Identify negative sentiment in user-generated content |
|
|
|
|
|
### Suitable Domains |
|
|
- Entertainment and media reviews |
|
|
- E-commerce product feedback |
|
|
- Social media posts |
|
|
- Customer service interactions |
|
|
- General English text sentiment analysis |
|
|
|
|
|
## β οΈ Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
- **Domain Specificity:** Primarily trained on movie reviews, may not generalize well to other domains |
|
|
- **Language:** English only, no multilingual support |
|
|
- **Context Length:** Limited to 256 tokens, longer texts are truncated |
|
|
- **Cultural Bias:** May reflect biases present in IMDB dataset |
|
|
|
|
|
### Potential Biases |
|
|
- **Genre Bias:** May perform differently across movie genres |
|
|
- **Temporal Bias:** Training data may reflect sentiment patterns from specific time periods |
|
|
- **Demographic Bias:** May not equally represent all demographic groups' sentiment expressions |
|
|
|
|
|
### Not Recommended For |
|
|
- Non-English text |
|
|
- Highly specialized domains (medical, legal, technical) |
|
|
- Real-time critical applications |
|
|
- Texts longer than 256 tokens without preprocessing |
|
|
- Sarcasm or irony detection |
|
|
|
|
|
## π§ Technical Specifications |
|
|
|
|
|
### Model Architecture |
|
|
``` |
|
|
DistilBERT Base |
|
|
βββ Transformer Layers: 6 |
|
|
βββ Hidden Size: 768 |
|
|
βββ Attention Heads: 12 |
|
|
βββ Intermediate Size: 3072 |
|
|
βββ Classification Head: Linear(768 β 3) |
|
|
``` |
|
|
|
|
|
### Input Format |
|
|
- **Text Encoding:** UTF-8 |
|
|
- **Tokenization:** WordPiece |
|
|
- **Special Tokens:** [CLS], [SEP] |
|
|
- **Max Length:** 256 tokens |
|
|
- **Padding:** Right padding with [PAD] tokens |
|
|
|
|
|
### Output Format |
|
|
```python |
|
|
{ |
|
|
'label': 'POSITIVE', # One of: NEGATIVE, POSITIVE |
|
|
'score': 0.9987 # Confidence score (0-1) |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{sentiment-analysis-distilbert, |
|
|
title={Fine-tuned DistilBERT for Sentiment Analysis}, |
|
|
author={Your Name}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/your-username/sentiment-analysis-distilbert} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details. |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Issues and pull requests are welcome! Please feel free to: |
|
|
- Report bugs or issues |
|
|
- Suggest improvements |
|
|
- Share your use cases |
|
|
- Contribute to documentation |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **Hugging Face** for the Transformers library and model hosting |
|
|
- **Google Research** for the original BERT and DistilBERT models |
|
|
- **Stanford AI Lab** for the IMDB dataset |
|
|
- **Google Colab** for providing free GPU resources for training |
|
|
|
|
|
--- |
|
|
|
|
|
*This model was created as part of a sentiment analysis fine-tuning project using modern NLP techniques and best practices.* |