File size: 6,434 Bytes
37a9a3a 70e9829 37a9a3a 70e9829 f100caf 70e9829 d991b7e 70e9829 d991b7e 70e9829 d991b7e 70e9829 d991b7e 70e9829 37a9a3a 70e9829 37a9a3a 70e9829 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- sentiment-analysis
- text-classification
- pytorch
- distilbert
- fine-tuned
datasets:
- imdb
language:
- en
pipeline_tag: text-classification
widget:
- text: "This movie is absolutely amazing! I loved every minute of it."
example_title: "Positive Example"
- text: "Terrible film, complete waste of time and money."
example_title: "Negative Example"
---
# DistilBERT Sentiment Analysis Model
This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for **2-class sentiment analysis** (Positive, Negative) on movie reviews.
## π― Model Description
- **Model Type:** Text Classification
- **Base Architecture:** DistilBERT (Distilled BERT)
- **Language:** English
- **Task:** Sentiment Analysis
- **Classes:** 2 (Negative, Positive)
- **Parameters:** ~66M
- **Model Size:** ~250MB
## π Quick Start
### Using Transformers Pipeline
```python
from transformers import pipeline
# Load the model
classifier = pipeline("sentiment-analysis",
model="your-username/sentiment-analysis-distilbert")
# Single prediction
result = classifier("This movie is fantastic!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9987}]
# Batch prediction
texts = [
"Amazing cinematography and great acting!",
"Boring and predictable storyline.",
"It was an okay movie, nothing extraordinary."
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"Text: {text}")
print(f"Sentiment: {result['label']} (Confidence: {result['score']:.3f})")
```
### Using AutoModel and AutoTokenizer
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "your-username/sentiment-analysis-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Prepare input
text = "This movie exceeded my expectations!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
# Get prediction
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Get predicted class
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()
labels = ["NEGATIVE", POSITIVE"]
print(f"Sentiment: {labels[predicted_class]} (Confidence: {confidence:.3f})")
```
## π Training Details
### Dataset
- **Source:** IMDB Movie Reviews Dataset
- **Training Samples:** 5,000 (balanced: 1,667 per class)
- **Evaluation Samples:** 1,000
- **Data Split:** 80% train, 20% validation
- **Preprocessing:** Tokenization with DistilBERT tokenizer, max length 256
### Training Configuration
- **Base Model:** `distilbert-base-uncased`
- **Training Framework:** PyTorch + Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Batch Size:** 8
- **Epochs:** 3
- **Warmup Steps:** 100
- **Weight Decay:** 0.01
- **Max Sequence Length:** 256 tokens
### Hardware
- **Platform:** Google Colab
- **GPU:** Tesla T4 (15GB VRAM)
- **Training Time:** ~30-45 minutes
## π Performance
| Metric | Score |
|--------|-------|
| Training Accuracy | ~95% |
| Validation Accuracy | ~93% |
| Training Loss | 0.12 |
| Validation Loss | 0.18 |
### Class Distribution
- **Negative:** 33.3% (2500 samples)
- **Positive:** 33.3% (2500 samples)
## π― Intended Use
### Primary Use Cases
- **Movie Review Analysis:** Classify sentiment of movie reviews
- **Product Review Sentiment:** Analyze customer feedback
- **Social Media Monitoring:** Track sentiment in posts and comments
- **Content Moderation:** Identify negative sentiment in user-generated content
### Suitable Domains
- Entertainment and media reviews
- E-commerce product feedback
- Social media posts
- Customer service interactions
- General English text sentiment analysis
## β οΈ Limitations and Biases
### Known Limitations
- **Domain Specificity:** Primarily trained on movie reviews, may not generalize well to other domains
- **Language:** English only, no multilingual support
- **Context Length:** Limited to 256 tokens, longer texts are truncated
- **Cultural Bias:** May reflect biases present in IMDB dataset
### Potential Biases
- **Genre Bias:** May perform differently across movie genres
- **Temporal Bias:** Training data may reflect sentiment patterns from specific time periods
- **Demographic Bias:** May not equally represent all demographic groups' sentiment expressions
### Not Recommended For
- Non-English text
- Highly specialized domains (medical, legal, technical)
- Real-time critical applications
- Texts longer than 256 tokens without preprocessing
- Sarcasm or irony detection
## π§ Technical Specifications
### Model Architecture
```
DistilBERT Base
βββ Transformer Layers: 6
βββ Hidden Size: 768
βββ Attention Heads: 12
βββ Intermediate Size: 3072
βββ Classification Head: Linear(768 β 3)
```
### Input Format
- **Text Encoding:** UTF-8
- **Tokenization:** WordPiece
- **Special Tokens:** [CLS], [SEP]
- **Max Length:** 256 tokens
- **Padding:** Right padding with [PAD] tokens
### Output Format
```python
{
'label': 'POSITIVE', # One of: NEGATIVE, POSITIVE
'score': 0.9987 # Confidence score (0-1)
}
```
## π Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{sentiment-analysis-distilbert,
title={Fine-tuned DistilBERT for Sentiment Analysis},
author={Your Name},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/your-username/sentiment-analysis-distilbert}
}
```
## π License
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.
## π€ Contributing
Issues and pull requests are welcome! Please feel free to:
- Report bugs or issues
- Suggest improvements
- Share your use cases
- Contribute to documentation
## π Acknowledgments
- **Hugging Face** for the Transformers library and model hosting
- **Google Research** for the original BERT and DistilBERT models
- **Stanford AI Lab** for the IMDB dataset
- **Google Colab** for providing free GPU resources for training
---
*This model was created as part of a sentiment analysis fine-tuning project using modern NLP techniques and best practices.* |