Urdu Spam Detection Model
Description
This model classifies Urdu text into:
- 0 → Not Spam
- 1 → Spam
It is designed for AI-powered emergency helpline systems (e.g., 1122/911) to filter prank or irrelevant calls in real time.
Model Details
- Base Model: xlm-roberta-base
- Task: Binary Text Classification
- Language: Urdu
- Framework: Hugging Face Transformers
Training
- Dataset:
hamza-amin/urdu-spam-dataset - Epochs: 3
- Batch size: 16
- Learning rate: 2e-5
Evaluation Results
| Metric | Score |
|---|---|
| Accuracy | 0.983 |
| F1 Score | 0.982 |
| Precision | 0.993 |
| Recall | 0.972 |
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "hamza-amin/urdu-spam-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)
return probs.tolist()
print(predict("یہ ایک ایمرجنسی ہے فوراً مدد کریں"))
print(predict("ہیلو بس مذاق کر رہا تھا"))
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 3
- mixed_precision_training: Native AMP
Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
|---|---|---|---|---|---|---|---|
| 0.1146 | 1.0 | 169 | 0.1139 | 0.9834 | 0.9826 | 0.9860 | 0.9792 |
| 0.0902 | 2.0 | 338 | 0.0954 | 0.9834 | 0.9825 | 0.9929 | 0.9722 |
| 0.0085 | 3.0 | 507 | 0.0972 | 0.9867 | 0.9859 | 1.0 | 0.9722 |
Use Cases
- Emergency call filtering
- Spam detection in voice agents
- Urdu NLP classification tasks
Limitations
Trained on synthetic data
May struggle with:
- slang / mixed Urdu-English
- very short inputs
- noisy speech-to-text output
Future Work
- Add urgency classification (multi-class)
- Train on real call data
- Optimize for real-time inference (ONNX / quantization)
License
MIT
- Downloads last month
- 16
Model tree for hamza-amin/urdu-spam-classifier
Base model
FacebookAI/xlm-roberta-base