Update README.md (#1)

3d6692b 1 day ago

5.56 kB

	---
	language: en
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- text-classification
	- sentiment-analysis
	- distilbert
	- imdb
	- mlops
	datasets:
	- stanfordnlp/imdb
	base_model: distilbert-base-uncased
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: mlops-group-sentiment
	results:
	- task:
	type: text-classification
	name: Sentiment Classification
	dataset:
	type: stanfordnlp/imdb
	name: IMDB
	metrics:
	- type: accuracy
	value: 0.90
	name: Test Accuracy
	- type: f1
	value: 0.90
	name: Test F1 (weighted)
	---

	# mlops-group-sentiment

	A `distilbert-base-uncased` model fine-tuned on the IMDB movie reviews dataset
	for binary sentiment classification (positive / negative).

	This model is the final artifact of an MLOps group project at IIT Jodhpur
	(Course CSL7040), demonstrating an end-to-end production ML pipeline: version
	control on GitHub, GPU training on Kaggle, experiment tracking on Weights &
	Biases, container packaging via Docker, and deployment to the Hugging Face Hub.

	## How to Use

	```python
	from transformers import pipeline

	classifier = pipeline("sentiment-analysis", model="pujaniitj/mlops-group-sentiment")
	result = classifier("This movie was fantastic!")
	print(result)
	# [{'label': 'positive', 'score': 0.9876}]
	```

	## Intended Use

	Primary use case: Classifying English-language movie reviews as positive
	or negative sentiment.

	Out-of-scope uses:
	- Non-English text (model only trained on English IMDB reviews)
	- Domain shift — e.g. tweets, product reviews, news articles, customer support
	transcripts. Performance will degrade outside the movie-review domain.
	- Fine-grained sentiment (beyond binary pos/neg, e.g. 5-star ratings)
	- High-stakes decisions or content moderation without human review

	## Model Description

	- Base architecture: DistilBERT (`distilbert-base-uncased`)
	- Distinct from base: Fine-tuned classification head (2 output labels)
	- Parameters: ~66 million
	- Tokenizer: WordPiece (DistilBERT default)
	- Max sequence length: 256 tokens
	- Labels: `0 → negative`, `1 → positive`

	## Training Data

	- Dataset: [IMDB Movie Reviews](https://huggingface.co/datasets/stanfordnlp/imdb)
	- Train size: 25,000 reviews (12,500 positive + 12,500 negative — perfectly balanced)
	- Test size: 25,000 reviews (same balance)
	- Train/Validation split: 90/10 of the train set, with `seed=42`

	## Training Procedure

	### Hyperparameters

	\| Setting \| Value \|
	\|----------------------\|--------\|
	\| Learning rate \| 3e-5 \|
	\| Train batch size \| 16 \|
	\| Eval batch size \| 32 \|
	\| Epochs \| 3 \|
	\| Max sequence length \| 256 \|
	\| Warmup ratio \| 0.1 \|
	\| Weight decay \| 0.01 \|
	\| Optimizer \| AdamW \|
	\| Mixed precision \| fp16 \|
	\| Seed \| 42 \|

	### Training Environment

	- Platform: Kaggle Notebook
	- Hardware: 2× NVIDIA Tesla T4 GPU
	- Training time: ~17 minutes

	### Experiment Tracking

	Two configurations were trained and compared via Weights & Biases:

	\| Run \| Learning rate \| Test F1 \| Test Accuracy \| Test Loss \|
	\|------\|---------------\|---------\|---------------\|-----------\|
	\| v1 (this model) \| 3e-5 \| ~0.90 \| ~0.90 \| ~0.70 \|
	\| v2 (discarded) \| 5e-5 \| ~0.91 \| ~0.91 \| ~0.85 \|

	> Replace these values with the exact decimals from your W&B run summary
	> before publishing the final model card.

	Why v1 was selected: While v2 achieved a marginally higher F1 (~0.5%),
	it showed clear signs of overfitting — its eval loss climbed sharply across
	epochs while v1's remained more stable. v1 also delivers ~25% faster inference,
	making it the better choice for a production deployment.

	## Evaluation Results

	Evaluation on the held-out IMDB test set (25,000 reviews):

	\| Metric \| Value \|
	\|---------------------\|-------\|
	\| Accuracy \| ~0.90 \|
	\| F1 (weighted) \| ~0.90 \|
	\| Precision (weighted)\| ~0.90 \|
	\| Recall (weighted) \| ~0.90 \|

	## Limitations and Biases

	- Domain: Only trained on movie reviews. Expect degraded performance on
	other domains.
	- Length: Inputs are truncated to 256 tokens (~200 words). Longer reviews
	may lose tail information that matters for sentiment.
	- Language: English only.
	- Demographic biases: IMDB reviewers historically skew toward certain
	demographics (e.g., predominantly male, English-speaking). The model may
	inherit these biases — e.g., it may misclassify reviews using vernacular or
	cultural references underrepresented in IMDB.
	- Sarcasm and irony: Like most BERT-based classifiers, the model can
	struggle with sarcastic or ironic text where the surface sentiment opposes
	the intended meaning.

	## Project Resources

	- GitHub repository: https://github.com/pujaniitj/mlops-group-project-iitj
	- W&B experiment dashboard: https://wandb.ai/pujaniitj-iit-jodpur/MLops_group_8
	- Training notebook (v1): https://www.kaggle.com/code/pujaniitj/mlops-group-8-imdb-v1
	- Training notebook (v2): https://www.kaggle.com/code/pujaniitj/mlops-group-8-imdb-v2

	## Acknowledgments

	- Base model: [DistilBERT](https://huggingface.co/distilbert-base-uncased)
	by Sanh et al. (Hugging Face)
	- Dataset: [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb)
	by Maas et al. (Stanford NLP)
	- Training infrastructure: [Kaggle Notebooks](https://www.kaggle.com)
	- Experiment tracking: [Weights & Biases](https://wandb.ai)