Urdu Spam Detection Model

Description

This model classifies Urdu text into:

  • 0 → Not Spam
  • 1 → Spam

It is designed for AI-powered emergency helpline systems (e.g., 1122/911) to filter prank or irrelevant calls in real time.


Model Details

  • Base Model: xlm-roberta-base
  • Task: Binary Text Classification
  • Language: Urdu
  • Framework: Hugging Face Transformers

Training

  • Dataset: hamza-amin/urdu-spam-dataset
  • Epochs: 3
  • Batch size: 16
  • Learning rate: 2e-5

Evaluation Results

Metric Score
Accuracy 0.983
F1 Score 0.982
Precision 0.993
Recall 0.972

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "hamza-amin/urdu-spam-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def predict(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=1)
    return probs.tolist()
print(predict("یہ ایک ایمرجنسی ہے فوراً مدد کریں"))
print(predict("ہیلو بس مذاق کر رہا تھا"))

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 3
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss Accuracy F1 Precision Recall
0.1146 1.0 169 0.1139 0.9834 0.9826 0.9860 0.9792
0.0902 2.0 338 0.0954 0.9834 0.9825 0.9929 0.9722
0.0085 3.0 507 0.0972 0.9867 0.9859 1.0 0.9722

Use Cases

  • Emergency call filtering
  • Spam detection in voice agents
  • Urdu NLP classification tasks

Limitations

  • Trained on synthetic data

  • May struggle with:

    • slang / mixed Urdu-English
    • very short inputs
    • noisy speech-to-text output

Future Work

  • Add urgency classification (multi-class)
  • Train on real call data
  • Optimize for real-time inference (ONNX / quantization)

License

MIT

Downloads last month
16
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hamza-amin/urdu-spam-classifier

Finetuned
(3934)
this model

Dataset used to train hamza-amin/urdu-spam-classifier