MLOPS_GROUP_PROJECT / README.md
Pujaniitj's picture
Update README.md (#1)
3d6692b
---
language: en
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- sentiment-analysis
- distilbert
- imdb
- mlops
datasets:
- stanfordnlp/imdb
base_model: distilbert-base-uncased
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: mlops-group-sentiment
results:
- task:
type: text-classification
name: Sentiment Classification
dataset:
type: stanfordnlp/imdb
name: IMDB
metrics:
- type: accuracy
value: 0.90
name: Test Accuracy
- type: f1
value: 0.90
name: Test F1 (weighted)
---
# mlops-group-sentiment
A `distilbert-base-uncased` model fine-tuned on the IMDB movie reviews dataset
for binary sentiment classification (positive / negative).
This model is the final artifact of an MLOps group project at IIT Jodhpur
(Course CSL7040), demonstrating an end-to-end production ML pipeline: version
control on GitHub, GPU training on Kaggle, experiment tracking on Weights &
Biases, container packaging via Docker, and deployment to the Hugging Face Hub.
## How to Use
```python
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="pujaniitj/mlops-group-sentiment")
result = classifier("This movie was fantastic!")
print(result)
# [{'label': 'positive', 'score': 0.9876}]
```
## Intended Use
**Primary use case**: Classifying English-language movie reviews as positive
or negative sentiment.
**Out-of-scope uses**:
- Non-English text (model only trained on English IMDB reviews)
- Domain shift — e.g. tweets, product reviews, news articles, customer support
transcripts. Performance will degrade outside the movie-review domain.
- Fine-grained sentiment (beyond binary pos/neg, e.g. 5-star ratings)
- High-stakes decisions or content moderation without human review
## Model Description
- **Base architecture**: DistilBERT (`distilbert-base-uncased`)
- **Distinct from base**: Fine-tuned classification head (2 output labels)
- **Parameters**: ~66 million
- **Tokenizer**: WordPiece (DistilBERT default)
- **Max sequence length**: 256 tokens
- **Labels**: `0 → negative`, `1 → positive`
## Training Data
- **Dataset**: [IMDB Movie Reviews](https://huggingface.co/datasets/stanfordnlp/imdb)
- **Train size**: 25,000 reviews (12,500 positive + 12,500 negative — perfectly balanced)
- **Test size**: 25,000 reviews (same balance)
- **Train/Validation split**: 90/10 of the train set, with `seed=42`
## Training Procedure
### Hyperparameters
| Setting | Value |
|----------------------|--------|
| Learning rate | 3e-5 |
| Train batch size | 16 |
| Eval batch size | 32 |
| Epochs | 3 |
| Max sequence length | 256 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Mixed precision | fp16 |
| Seed | 42 |
### Training Environment
- **Platform**: Kaggle Notebook
- **Hardware**: 2× NVIDIA Tesla T4 GPU
- **Training time**: ~17 minutes
### Experiment Tracking
Two configurations were trained and compared via Weights & Biases:
| Run | Learning rate | Test F1 | Test Accuracy | Test Loss |
|------|---------------|---------|---------------|-----------|
| v1 (this model) | 3e-5 | ~0.90 | ~0.90 | ~0.70 |
| v2 (discarded) | 5e-5 | ~0.91 | ~0.91 | ~0.85 |
> Replace these values with the exact decimals from your W&B run summary
> before publishing the final model card.
**Why v1 was selected**: While v2 achieved a marginally higher F1 (~0.5%),
it showed clear signs of overfitting — its eval loss climbed sharply across
epochs while v1's remained more stable. v1 also delivers ~25% faster inference,
making it the better choice for a production deployment.
## Evaluation Results
Evaluation on the held-out IMDB test set (25,000 reviews):
| Metric | Value |
|---------------------|-------|
| Accuracy | ~0.90 |
| F1 (weighted) | ~0.90 |
| Precision (weighted)| ~0.90 |
| Recall (weighted) | ~0.90 |
## Limitations and Biases
- **Domain**: Only trained on movie reviews. Expect degraded performance on
other domains.
- **Length**: Inputs are truncated to 256 tokens (~200 words). Longer reviews
may lose tail information that matters for sentiment.
- **Language**: English only.
- **Demographic biases**: IMDB reviewers historically skew toward certain
demographics (e.g., predominantly male, English-speaking). The model may
inherit these biases — e.g., it may misclassify reviews using vernacular or
cultural references underrepresented in IMDB.
- **Sarcasm and irony**: Like most BERT-based classifiers, the model can
struggle with sarcastic or ironic text where the surface sentiment opposes
the intended meaning.
## Project Resources
- **GitHub repository**: https://github.com/pujaniitj/mlops-group-project-iitj
- **W&B experiment dashboard**: https://wandb.ai/pujaniitj-iit-jodpur/MLops_group_8
- **Training notebook (v1)**: https://www.kaggle.com/code/pujaniitj/mlops-group-8-imdb-v1
- **Training notebook (v2)**: https://www.kaggle.com/code/pujaniitj/mlops-group-8-imdb-v2
## Acknowledgments
- **Base model**: [DistilBERT](https://huggingface.co/distilbert-base-uncased)
by Sanh et al. (Hugging Face)
- **Dataset**: [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb)
by Maas et al. (Stanford NLP)
- **Training infrastructure**: [Kaggle Notebooks](https://www.kaggle.com)
- **Experiment tracking**: [Weights & Biases](https://wandb.ai)