balaaathi's picture
Update README.md
839995a verified
metadata
license: apache-2.0
base_model: distilbert-base-uncased
tags:
  - sentiment-analysis
  - text-classification
  - pytorch
  - distilbert
  - fine-tuned
datasets:
  - imdb
language:
  - en
pipeline_tag: text-classification
widget:
  - text: This movie is absolutely amazing! I loved every minute of it.
    example_title: Positive Example
  - text: Terrible film, complete waste of time and money.
    example_title: Negative Example

DistilBERT Sentiment Analysis Model

This model is a fine-tuned version of distilbert-base-uncased for 2-class sentiment analysis (Positive, Negative) on movie reviews.

🎯 Model Description

  • Model Type: Text Classification
  • Base Architecture: DistilBERT (Distilled BERT)
  • Language: English
  • Task: Sentiment Analysis
  • Classes: 2 (Negative, Positive)
  • Parameters: ~66M
  • Model Size: ~250MB

πŸš€ Quick Start

Using Transformers Pipeline

from transformers import pipeline

# Load the model
classifier = pipeline("sentiment-analysis", 
                     model="your-username/sentiment-analysis-distilbert")

# Single prediction
result = classifier("This movie is fantastic!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9987}]

# Batch prediction
texts = [
    "Amazing cinematography and great acting!",
    "Boring and predictable storyline.",
    "It was an okay movie, nothing extraordinary."
]
results = classifier(texts)
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']} (Confidence: {result['score']:.3f})")

Using AutoModel and AutoTokenizer

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "your-username/sentiment-analysis-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare input
text = "This movie exceeded my expectations!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Get prediction
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
# Get predicted class
predicted_class = torch.argmax(predictions, dim=-1).item()
confidence = predictions[0][predicted_class].item()

labels = ["NEGATIVE", POSITIVE"]
print(f"Sentiment: {labels[predicted_class]} (Confidence: {confidence:.3f})")

πŸ“Š Training Details

Dataset

  • Source: IMDB Movie Reviews Dataset
  • Training Samples: 5,000 (balanced: 1,667 per class)
  • Evaluation Samples: 1,000
  • Data Split: 80% train, 20% validation
  • Preprocessing: Tokenization with DistilBERT tokenizer, max length 256

Training Configuration

  • Base Model: distilbert-base-uncased
  • Training Framework: PyTorch + Transformers
  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 8
  • Epochs: 3
  • Warmup Steps: 100
  • Weight Decay: 0.01
  • Max Sequence Length: 256 tokens

Hardware

  • Platform: Google Colab
  • GPU: Tesla T4 (15GB VRAM)
  • Training Time: ~30-45 minutes

πŸ“ˆ Performance

Metric Score
Training Accuracy ~95%
Validation Accuracy ~93%
Training Loss 0.12
Validation Loss 0.18

Class Distribution

  • Negative: 33.3% (2500 samples)
  • Positive: 33.3% (2500 samples)

🎯 Intended Use

Primary Use Cases

  • Movie Review Analysis: Classify sentiment of movie reviews
  • Product Review Sentiment: Analyze customer feedback
  • Social Media Monitoring: Track sentiment in posts and comments
  • Content Moderation: Identify negative sentiment in user-generated content

Suitable Domains

  • Entertainment and media reviews
  • E-commerce product feedback
  • Social media posts
  • Customer service interactions
  • General English text sentiment analysis

⚠️ Limitations and Biases

Known Limitations

  • Domain Specificity: Primarily trained on movie reviews, may not generalize well to other domains
  • Language: English only, no multilingual support
  • Context Length: Limited to 256 tokens, longer texts are truncated
  • Cultural Bias: May reflect biases present in IMDB dataset

Potential Biases

  • Genre Bias: May perform differently across movie genres
  • Temporal Bias: Training data may reflect sentiment patterns from specific time periods
  • Demographic Bias: May not equally represent all demographic groups' sentiment expressions

Not Recommended For

  • Non-English text
  • Highly specialized domains (medical, legal, technical)
  • Real-time critical applications
  • Texts longer than 256 tokens without preprocessing
  • Sarcasm or irony detection

πŸ”§ Technical Specifications

Model Architecture

DistilBERT Base
β”œβ”€β”€ Transformer Layers: 6
β”œβ”€β”€ Hidden Size: 768
β”œβ”€β”€ Attention Heads: 12
β”œβ”€β”€ Intermediate Size: 3072
└── Classification Head: Linear(768 β†’ 3)

Input Format

  • Text Encoding: UTF-8
  • Tokenization: WordPiece
  • Special Tokens: [CLS], [SEP]
  • Max Length: 256 tokens
  • Padding: Right padding with [PAD] tokens

Output Format

{
    'label': 'POSITIVE',  # One of: NEGATIVE, POSITIVE
    'score': 0.9987       # Confidence score (0-1)
}

πŸ“ Citation

If you use this model in your research or applications, please cite:

@misc{sentiment-analysis-distilbert,
  title={Fine-tuned DistilBERT for Sentiment Analysis},
  author={Your Name},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/your-username/sentiment-analysis-distilbert}
}

πŸ“„ License

This model is released under the Apache 2.0 License. See the LICENSE file for details.

🀝 Contributing

Issues and pull requests are welcome! Please feel free to:

  • Report bugs or issues
  • Suggest improvements
  • Share your use cases
  • Contribute to documentation

πŸ™ Acknowledgments

  • Hugging Face for the Transformers library and model hosting
  • Google Research for the original BERT and DistilBERT models
  • Stanford AI Lab for the IMDB dataset
  • Google Colab for providing free GPU resources for training

This model was created as part of a sentiment analysis fine-tuning project using modern NLP techniques and best practices.