--- language: en license: mit tags: - text-classification - sentiment-analysis - distilbert - imdb - transformers - mlops datasets: - amarshiv86/sentiment-analysis-imdb-dataset metrics: - accuracy - f1 - roc_auc base_model: distilbert-base-uncased --- # 🎭 Sentiment Analysis — IMDB Reviews A binary sentiment classifier fine-tuned on IMDB movie reviews, predicting **POSITIVE** or **NEGATIVE** sentiment with confidence scores. --- ## 📊 Model Performance | Metric | Score | |-----------|--------| | Accuracy | 0.894 | | F1 Score | 0.893 | | ROC-AUC | 0.960 | | Precision | 0.884 | | Recall | 0.902 | ### Confusion Matrix ![Confusion Matrix](artifacts/confusion_matrix.png) --- ## 🤖 Model Details | Property | Value | |---|---| | Base model | `distilbert-base-uncased` | | Task | Binary text classification | | Labels | `NEGATIVE` (0), `POSITIVE` (1) | | Max token length | 256 | | Training samples | 5,000 (IMDB subset) | | Epochs | 2 | | Batch size | 16 | | Learning rate | 2e-5 | | Framework | HuggingFace Transformers + Trainer API | | Experiment tracking | MLflow | --- ## 🚀 How to Use ```python from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification MODEL_PATH = "amarshiv86/sentiment-analysis-imdb-model" tokenizer = AutoTokenizer.from_pretrained(f"{MODEL_PATH}/model") model = AutoModelForSequenceClassification.from_pretrained(f"{MODEL_PATH}/model") clf = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, truncation=True) results = clf([ "This movie was absolutely fantastic, loved every minute!", "Terrible film, complete waste of time.", ]) for r in results: print(f"{r['label']} — {r['score']:.1%} confidence") ``` --- ## 🔁 MLOps Pipeline Automatically retrained via GitHub Actions whenever `src/` or `params.yaml` changes: ``` GitHub Push ↓ GitHub Actions ↓ prepare.py → train.py → evaluate.py ↓ ↓ model files metrics.json confusion_matrix.png ↓ HuggingFace Hub (this repo) ``` --- ## 📁 Repository Structure ``` amarshiv86/sentiment-analysis-imdb-model ├── model/ │ ├── model.safetensors # fine-tuned weights (268 MB) │ ├── config.json # model architecture config │ ├── tokenizer.json # tokenizer vocab │ └── tokenizer_config.json # tokenizer settings ├── artifacts/ │ └── confusion_matrix.png # evaluation plot └── metrics.json # latest eval metrics ``` --- ## 📄 Dataset Trained on a 5,000-sample subset of the IMDB dataset. Full processed dataset: [amarshiv86/sentiment-analysis-imdb-dataset](https://huggingface.co/datasets/amarshiv86/sentiment-analysis-imdb-dataset) --- ## 📄 License MIT — free to use, modify, and distribute.