Urdu Spam Detection Model

Description

This model classifies Urdu text into:

0 → Not Spam
1 → Spam

It is designed for AI-powered emergency helpline systems (e.g., 1122/911) to filter prank or irrelevant calls in real time.

Model Details

Base Model: xlm-roberta-base
Task: Binary Text Classification
Language: Urdu
Framework: Hugging Face Transformers

Training

Dataset: hamza-amin/urdu-spam-dataset
Epochs: 3
Batch size: 16
Learning rate: 2e-5

Evaluation Results

Metric	Score
Accuracy	0.983
F1 Score	0.982
Precision	0.993
Recall	0.972

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "hamza-amin/urdu-spam-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    return probs.tolist()
print(predict("یہ ایک ایمرجنسی ہے فوراً مدد کریں"))
print(predict("ہیلو بس مذاق کر رہا تھا"))

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
num_epochs: 3
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1	Precision	Recall
0.1146	1.0	169	0.1139	0.9834	0.9826	0.9860	0.9792
0.0902	2.0	338	0.0954	0.9834	0.9825	0.9929	0.9722
0.0085	3.0	507	0.0972	0.9867	0.9859	1.0	0.9722

Use Cases

Emergency call filtering
Spam detection in voice agents
Urdu NLP classification tasks

Limitations

Trained on synthetic data
May struggle with:
- slang / mixed Urdu-English
- very short inputs
- noisy speech-to-text output

Future Work

Add urgency classification (multi-class)
Train on real call data
Optimize for real-time inference (ONNX / quantization)

License

MIT

Downloads last month: 16

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for hamza-amin/urdu-spam-classifier

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3934)

this model

hamza-amin
/

urdu-spam-classifier