File size: 6,434 Bytes

---
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- sentiment-analysis
- text-classification
- pytorch
- distilbert
- fine-tuned
datasets:
- imdb
language:
- en
pipeline_tag: text-classification
widget:
- text: "This movie is absolutely amazing! I loved every minute of it."
  example_title: "Positive Example"
- text: "Terrible film, complete waste of time and money."
  example_title: "Negative Example"
---

# DistilBERT Sentiment Analysis Model

This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for **2-class sentiment analysis** (Positive, Negative) on movie reviews.

## 🎯 Model Description

- **Model Type:** Text Classification
- **Base Architecture:** DistilBERT (Distilled BERT)
- **Language:** English
- **Task:** Sentiment Analysis
- **Classes:** 2 (Negative, Positive)
- **Parameters:** ~66M
- **Model Size:** ~250MB

## 🚀 Quick Start

### Using Transformers Pipeline

```python
from transformers import pipeline

# Load the model
classifier = pipeline("sentiment-analysis", 
                     model="your-username/sentiment-analysis-distilbert")

# Single prediction
result = classifier("This movie is fantastic!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9987}]

# Batch prediction
texts = [
    "Amazing cinematography and great acting!",
    "Boring and predictable storyline.",
    "It was an okay movie, nothing extraordinary."
]
results = classifier(texts)
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']} (Confidence: {result['score']:.3f})")
```

### Using AutoModel and AutoTokenizer

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "your-username/sentiment-analysis-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
text = "This movie exceeded my expectations!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
# Get predicted class
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()

labels = ["NEGATIVE", POSITIVE"]
print(f"Sentiment: {labels[predicted_class]} (Confidence: {confidence:.3f})")
```

## 📊 Training Details

### Dataset
- **Source:** IMDB Movie Reviews Dataset
- **Training Samples:** 5,000 (balanced: 1,667 per class)
- **Evaluation Samples:** 1,000
- **Data Split:** 80% train, 20% validation
- **Preprocessing:** Tokenization with DistilBERT tokenizer, max length 256

### Training Configuration
- **Base Model:** `distilbert-base-uncased`
- **Training Framework:** PyTorch + Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Batch Size:** 8
- **Epochs:** 3
- **Warmup Steps:** 100
- **Weight Decay:** 0.01
- **Max Sequence Length:** 256 tokens

### Hardware
- **Platform:** Google Colab
- **GPU:** Tesla T4 (15GB VRAM)
- **Training Time:** ~30-45 minutes

## 📈 Performance

| Metric | Score |
|--------|-------|
| Training Accuracy | ~95% |
| Validation Accuracy | ~93% |
| Training Loss | 0.12 |
| Validation Loss | 0.18 |

### Class Distribution
- **Negative:** 33.3% (2500 samples)
- **Positive:** 33.3% (2500 samples)

## 🎯 Intended Use

### Primary Use Cases
- **Movie Review Analysis:** Classify sentiment of movie reviews
- **Product Review Sentiment:** Analyze customer feedback
- **Social Media Monitoring:** Track sentiment in posts and comments
- **Content Moderation:** Identify negative sentiment in user-generated content

### Suitable Domains
- Entertainment and media reviews
- E-commerce product feedback
- Social media posts
- Customer service interactions
- General English text sentiment analysis

## ⚠️ Limitations and Biases

### Known Limitations
- **Domain Specificity:** Primarily trained on movie reviews, may not generalize well to other domains
- **Language:** English only, no multilingual support
- **Context Length:** Limited to 256 tokens, longer texts are truncated
- **Cultural Bias:** May reflect biases present in IMDB dataset

### Potential Biases
- **Genre Bias:** May perform differently across movie genres
- **Temporal Bias:** Training data may reflect sentiment patterns from specific time periods
- **Demographic Bias:** May not equally represent all demographic groups' sentiment expressions

### Not Recommended For
- Non-English text
- Highly specialized domains (medical, legal, technical)
- Real-time critical applications
- Texts longer than 256 tokens without preprocessing
- Sarcasm or irony detection

## 🔧 Technical Specifications

### Model Architecture
```
DistilBERT Base
├── Transformer Layers: 6
├── Hidden Size: 768
├── Attention Heads: 12
├── Intermediate Size: 3072
└── Classification Head: Linear(768 → 3)
```

### Input Format
- **Text Encoding:** UTF-8
- **Tokenization:** WordPiece
- **Special Tokens:** [CLS], [SEP]
- **Max Length:** 256 tokens
- **Padding:** Right padding with [PAD] tokens

### Output Format
```python
{
    'label': 'POSITIVE',  # One of: NEGATIVE, POSITIVE
    'score': 0.9987       # Confidence score (0-1)
}
```

## 📝 Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{sentiment-analysis-distilbert,
  title={Fine-tuned DistilBERT for Sentiment Analysis},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/sentiment-analysis-distilbert}
}
```

## 📄 License

This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.

## 🤝 Contributing

Issues and pull requests are welcome! Please feel free to:
- Report bugs or issues
- Suggest improvements
- Share your use cases
- Contribute to documentation

## 🙏 Acknowledgments

- **Hugging Face** for the Transformers library and model hosting
- **Google Research** for the original BERT and DistilBERT models
- **Stanford AI Lab** for the IMDB dataset
- **Google Colab** for providing free GPU resources for training

---

*This model was created as part of a sentiment analysis fine-tuning project using modern NLP techniques and best practices.*