MLOPS_group-v4 — SMS Spam Classification

This repository is part of the MLOps Group 36 Project for the PGD AI Programme, IIT Jodhpur.

The project implements an end-to-end MLOps pipeline for SMS spam classification using DistilBERT, with GitHub, Kaggle, Weights & Biases, Hugging Face Hub, Docker, and GitHub Actions.

Contributor

Anu Kumar
Roll Number: G25AIT2016

Project Links

Resource	Link
GitHub Repository	https://github.com/g25ait2032-prog/mlops-group36-iitj
Kaggle Notebook - G25AIT2016	https://www.kaggle.com/code/anukumarkg25ait2016/mlops-group36-data-preprocessing-g25ait2016
W&B Run - G25AIT2016	https://wandb.ai/g25ait2032-iit-jodhpur/MLOPS_Group/runs/j5fk4zll
W&B Project Dashboard	https://wandb.ai/g25ait2032-iit-jodhpur/MLOPS_Group
Hugging Face Model	https://huggingface.co/Aukrk/MLOPS_group-v4

Model Details

Item	Value
Base model	distilbert-base-uncased
Task	Binary text classification
Classes	ham, spam
Dataset	UCI SMS Spam Collection
Framework	Hugging Face Transformers
Output labels	0 = ham, 1 = spam

Contribution Summary

This repository is linked to the G25AIT2016 Task 2 workflow.

The completed contribution includes:

Loading the UCI SMS Spam dataset
Cleaning and normalising SMS text
Removing missing and duplicate records
Creating stratified train, validation, and test splits
Creating id2label.json and label2id.json
Running data sanity checks
Logging data-preparation metrics to W&B
Publishing this Hugging Face model repository for project traceability

Dataset Preparation Summary

Metric	Value
Raw samples	5,574
Duplicates removed	415
Cleaned samples	5,159
Train rows	3,611
Validation rows	774
Test rows	774
Sanity checks passed	21 / 21
Leakage check	Passed

Label Mapping

{
  "0": "ham",
  "1": "spam"
}

How to Use

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Aukrk/MLOPS_group-v4"
)

text = "Congratulations! You have won a free iPhone. Click here now."
result = classifier(text)

print(result)

Load Model Directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Aukrk/MLOPS_group-v4")
model = AutoModelForSequenceClassification.from_pretrained("Aukrk/MLOPS_group-v4")

Example Inputs

Text	Expected Output
Congratulations! You have won a free prize. Click here now.	spam
Can we meet tomorrow at 5 PM?	ham

W&B Traceability

The G25AIT2016 W&B run records data-preparation metrics such as:

Raw sample count
Duplicate removal count
Cleaned sample count
Train / validation / test split sizes
Sanity check status
Leakage check status

W&B Run: https://wandb.ai/g25ait2032-iit-jodhpur/MLOPS_Group/runs/j5fk4zll

Model Context

This model repository is published under the G25AIT2016 Hugging Face account for Group 36 project traceability.

The model artefact follows the Group 36 DistilBERT SMS spam classification workflow and is linked with the data-preparation contribution completed by Anu Kumar - G25AIT2016.

Limitations

The dataset is relatively small and focused on SMS messages.
The model may not generalise well to long emails, non-English messages, or modern scam formats.
Boundary cases mixing normal conversation and promotional text may be misclassified.
This is an academic MLOps demonstration and should not be used as the only spam detection control in production.

Intended Use

This repository is intended for:

Academic MLOps demonstration
SMS spam classification testing
Hugging Face deployment evidence
W&B traceability evidence
GitHub Actions / Docker inference integration

Not Intended For

Production-grade fraud detection
Legal, financial, or safety-critical filtering
Detecting all phishing or scam variants without further validation

Authors

MLOps Group 36
PGD AI Programme, IIT Jodhpur

Contributor for this repository:
Anu Kumar - G25AIT2016

Downloads last month: 39

Safetensors

Model size

67M params

Tensor type

F32

Model tree for Aukrk/MLOPS_group-v4

Base model

distilbert/distilbert-base-uncased

Finetuned

(11935)

this model

Aukrk
/

MLOPS_group-v4