File size: 6,434 Bytes
37a9a3a
70e9829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37a9a3a
 
70e9829
 
f100caf
70e9829
 
 
 
 
 
 
d991b7e
70e9829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d991b7e
70e9829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d991b7e
 
70e9829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d991b7e
70e9829
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37a9a3a
70e9829
37a9a3a
70e9829
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
license: apache-2.0
base_model: distilbert-base-uncased
tags:
- sentiment-analysis
- text-classification
- pytorch
- distilbert
- fine-tuned
datasets:
- imdb
language:
- en
pipeline_tag: text-classification
widget:
- text: "This movie is absolutely amazing! I loved every minute of it."
  example_title: "Positive Example"
- text: "Terrible film, complete waste of time and money."
  example_title: "Negative Example"
---

# DistilBERT Sentiment Analysis Model

This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for **2-class sentiment analysis** (Positive, Negative) on movie reviews.

## 🎯 Model Description

- **Model Type:** Text Classification
- **Base Architecture:** DistilBERT (Distilled BERT)
- **Language:** English
- **Task:** Sentiment Analysis
- **Classes:** 2 (Negative, Positive)
- **Parameters:** ~66M
- **Model Size:** ~250MB

## πŸš€ Quick Start

### Using Transformers Pipeline

```python
from transformers import pipeline

# Load the model
classifier = pipeline("sentiment-analysis", 
                     model="your-username/sentiment-analysis-distilbert")

# Single prediction
result = classifier("This movie is fantastic!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9987}]

# Batch prediction
texts = [
    "Amazing cinematography and great acting!",
    "Boring and predictable storyline.",
    "It was an okay movie, nothing extraordinary."
]
results = classifier(texts)
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']} (Confidence: {result['score']:.3f})")
```

### Using AutoModel and AutoTokenizer

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "your-username/sentiment-analysis-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
text = "This movie exceeded my expectations!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
# Get predicted class
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()

labels = ["NEGATIVE", POSITIVE"]
print(f"Sentiment: {labels[predicted_class]} (Confidence: {confidence:.3f})")
```

## πŸ“Š Training Details

### Dataset
- **Source:** IMDB Movie Reviews Dataset
- **Training Samples:** 5,000 (balanced: 1,667 per class)
- **Evaluation Samples:** 1,000
- **Data Split:** 80% train, 20% validation
- **Preprocessing:** Tokenization with DistilBERT tokenizer, max length 256

### Training Configuration
- **Base Model:** `distilbert-base-uncased`
- **Training Framework:** PyTorch + Transformers
- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Batch Size:** 8
- **Epochs:** 3
- **Warmup Steps:** 100
- **Weight Decay:** 0.01
- **Max Sequence Length:** 256 tokens

### Hardware
- **Platform:** Google Colab
- **GPU:** Tesla T4 (15GB VRAM)
- **Training Time:** ~30-45 minutes

## πŸ“ˆ Performance

| Metric | Score |
|--------|-------|
| Training Accuracy | ~95% |
| Validation Accuracy | ~93% |
| Training Loss | 0.12 |
| Validation Loss | 0.18 |

### Class Distribution
- **Negative:** 33.3% (2500 samples)
- **Positive:** 33.3% (2500 samples)

## 🎯 Intended Use

### Primary Use Cases
- **Movie Review Analysis:** Classify sentiment of movie reviews
- **Product Review Sentiment:** Analyze customer feedback
- **Social Media Monitoring:** Track sentiment in posts and comments
- **Content Moderation:** Identify negative sentiment in user-generated content

### Suitable Domains
- Entertainment and media reviews
- E-commerce product feedback
- Social media posts
- Customer service interactions
- General English text sentiment analysis

## ⚠️ Limitations and Biases

### Known Limitations
- **Domain Specificity:** Primarily trained on movie reviews, may not generalize well to other domains
- **Language:** English only, no multilingual support
- **Context Length:** Limited to 256 tokens, longer texts are truncated
- **Cultural Bias:** May reflect biases present in IMDB dataset

### Potential Biases
- **Genre Bias:** May perform differently across movie genres
- **Temporal Bias:** Training data may reflect sentiment patterns from specific time periods
- **Demographic Bias:** May not equally represent all demographic groups' sentiment expressions

### Not Recommended For
- Non-English text
- Highly specialized domains (medical, legal, technical)
- Real-time critical applications
- Texts longer than 256 tokens without preprocessing
- Sarcasm or irony detection

## πŸ”§ Technical Specifications

### Model Architecture
```
DistilBERT Base
β”œβ”€β”€ Transformer Layers: 6
β”œβ”€β”€ Hidden Size: 768
β”œβ”€β”€ Attention Heads: 12
β”œβ”€β”€ Intermediate Size: 3072
└── Classification Head: Linear(768 β†’ 3)
```

### Input Format
- **Text Encoding:** UTF-8
- **Tokenization:** WordPiece
- **Special Tokens:** [CLS], [SEP]
- **Max Length:** 256 tokens
- **Padding:** Right padding with [PAD] tokens

### Output Format
```python
{
    'label': 'POSITIVE',  # One of: NEGATIVE, POSITIVE
    'score': 0.9987       # Confidence score (0-1)
}
```

## πŸ“ Citation

If you use this model in your research or applications, please cite:

```bibtex
@misc{sentiment-analysis-distilbert,
  title={Fine-tuned DistilBERT for Sentiment Analysis},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/sentiment-analysis-distilbert}
}
```

## πŸ“„ License

This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.

## 🀝 Contributing

Issues and pull requests are welcome! Please feel free to:
- Report bugs or issues
- Suggest improvements
- Share your use cases
- Contribute to documentation

## πŸ™ Acknowledgments

- **Hugging Face** for the Transformers library and model hosting
- **Google Research** for the original BERT and DistilBERT models
- **Stanford AI Lab** for the IMDB dataset
- **Google Colab** for providing free GPU resources for training

---

*This model was created as part of a sentiment analysis fine-tuning project using modern NLP techniques and best practices.*