MLOPS_group-v4 — SMS Spam Classification

This repository is part of the MLOps Group 36 Project for the PGD AI Programme, IIT Jodhpur.

The project implements an end-to-end MLOps pipeline for SMS spam classification using DistilBERT, with GitHub, Kaggle, Weights & Biases, Hugging Face Hub, Docker, and GitHub Actions.

Contributor

Anu Kumar
Roll Number: G25AIT2016

Project Links

Model Details

Item Value
Base model distilbert-base-uncased
Task Binary text classification
Classes ham, spam
Dataset UCI SMS Spam Collection
Framework Hugging Face Transformers
Output labels 0 = ham, 1 = spam

Contribution Summary

This repository is linked to the G25AIT2016 Task 2 workflow.

The completed contribution includes:

  • Loading the UCI SMS Spam dataset
  • Cleaning and normalising SMS text
  • Removing missing and duplicate records
  • Creating stratified train, validation, and test splits
  • Creating id2label.json and label2id.json
  • Running data sanity checks
  • Logging data-preparation metrics to W&B
  • Publishing this Hugging Face model repository for project traceability

Dataset Preparation Summary

Metric Value
Raw samples 5,574
Duplicates removed 415
Cleaned samples 5,159
Train rows 3,611
Validation rows 774
Test rows 774
Sanity checks passed 21 / 21
Leakage check Passed

Label Mapping

{
  "0": "ham",
  "1": "spam"
}

How to Use

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Aukrk/MLOPS_group-v4"
)

text = "Congratulations! You have won a free iPhone. Click here now."
result = classifier(text)

print(result)

Load Model Directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("Aukrk/MLOPS_group-v4")
model = AutoModelForSequenceClassification.from_pretrained("Aukrk/MLOPS_group-v4")

Example Inputs

Text Expected Output
Congratulations! You have won a free prize. Click here now. spam
Can we meet tomorrow at 5 PM? ham

W&B Traceability

The G25AIT2016 W&B run records data-preparation metrics such as:

  • Raw sample count
  • Duplicate removal count
  • Cleaned sample count
  • Train / validation / test split sizes
  • Sanity check status
  • Leakage check status

W&B Run: https://wandb.ai/g25ait2032-iit-jodhpur/MLOPS_Group/runs/j5fk4zll

Model Context

This model repository is published under the G25AIT2016 Hugging Face account for Group 36 project traceability.

The model artefact follows the Group 36 DistilBERT SMS spam classification workflow and is linked with the data-preparation contribution completed by Anu Kumar - G25AIT2016.

Limitations

  • The dataset is relatively small and focused on SMS messages.
  • The model may not generalise well to long emails, non-English messages, or modern scam formats.
  • Boundary cases mixing normal conversation and promotional text may be misclassified.
  • This is an academic MLOps demonstration and should not be used as the only spam detection control in production.

Intended Use

This repository is intended for:

  • Academic MLOps demonstration
  • SMS spam classification testing
  • Hugging Face deployment evidence
  • W&B traceability evidence
  • GitHub Actions / Docker inference integration

Not Intended For

  • Production-grade fraud detection
  • Legal, financial, or safety-critical filtering
  • Detecting all phishing or scam variants without further validation

Authors

MLOps Group 36
PGD AI Programme, IIT Jodhpur

Contributor for this repository:
Anu Kumar - G25AIT2016

Downloads last month
39
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Aukrk/MLOPS_group-v4

Finetuned
(11935)
this model

Dataset used to train Aukrk/MLOPS_group-v4