---
language: en
license: mit
tags:
  - text-classification
  - sentiment-analysis
  - distilbert
  - imdb
  - transformers
  - mlops
datasets:
  - amarshiv86/sentiment-analysis-imdb-dataset
metrics:
  - accuracy
  - f1
  - roc_auc
base_model: distilbert-base-uncased
---

# 🎭 Sentiment Analysis — IMDB Reviews

A binary sentiment classifier fine-tuned on IMDB movie reviews, predicting
**POSITIVE** or **NEGATIVE** sentiment with confidence scores.

---

## 📊 Model Performance

| Metric    | Score  |
|-----------|--------|
| Accuracy  | 0.894  |
| F1 Score  | 0.893  |
| ROC-AUC   | 0.960  |
| Precision | 0.884  |
| Recall    | 0.902  |

### Confusion Matrix
![Confusion Matrix](artifacts/confusion_matrix.png)

---

## 🤖 Model Details

| Property | Value |
|---|---|
| Base model | `distilbert-base-uncased` |
| Task | Binary text classification |
| Labels | `NEGATIVE` (0), `POSITIVE` (1) |
| Max token length | 256 |
| Training samples | 5,000 (IMDB subset) |
| Epochs | 2 |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Framework | HuggingFace Transformers + Trainer API |
| Experiment tracking | MLflow |

---

## 🚀 How to Use

```python
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

MODEL_PATH = "amarshiv86/sentiment-analysis-imdb-model"

tokenizer = AutoTokenizer.from_pretrained(f"{MODEL_PATH}/model")
model     = AutoModelForSequenceClassification.from_pretrained(f"{MODEL_PATH}/model")

clf = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, truncation=True)

results = clf([
    "This movie was absolutely fantastic, loved every minute!",
    "Terrible film, complete waste of time.",
])

for r in results:
    print(f"{r['label']} — {r['score']:.1%} confidence")
```

---

## 🔁 MLOps Pipeline

Automatically retrained via GitHub Actions whenever `src/` or `params.yaml` changes:

```
GitHub Push
    ↓
GitHub Actions
    ↓
prepare.py → train.py → evaluate.py
    ↓                        ↓
model files            metrics.json
                       confusion_matrix.png
    ↓
HuggingFace Hub (this repo)
```

---

## 📁 Repository Structure

```
amarshiv86/sentiment-analysis-imdb-model
├── model/
│   ├── model.safetensors      # fine-tuned weights (268 MB)
│   ├── config.json            # model architecture config
│   ├── tokenizer.json         # tokenizer vocab
│   └── tokenizer_config.json  # tokenizer settings
├── artifacts/
│   └── confusion_matrix.png   # evaluation plot
└── metrics.json               # latest eval metrics
```

---

## 📄 Dataset

Trained on a 5,000-sample subset of the IMDB dataset.
Full processed dataset: [amarshiv86/sentiment-analysis-imdb-dataset](https://huggingface.co/datasets/amarshiv86/sentiment-analysis-imdb-dataset)

---

## 📄 License

MIT — free to use, modify, and distribute.