File size: 4,970 Bytes
a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 a966011 bd6d389 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
language:
- id
base_model:
- google/gemma-2-2b
pipeline_tag: text-classification
library_name: transformers
tags:
- spam-detection
- text-classification
- indonesian
- chatbot
- security
---
# Indonesian Spam Detection Model
## Model Overview
**Indonesian Spam Detection Model** is a fine-tuned spam detection model based on the **Gemma 2 2B** architecture. This model is specifically designed for identifying spam messages in Indonesian text, particularly for WhatsApp chatbot interactions. It has been fine-tuned using a comprehensive dataset of 40,000 spam messages collected over a year.
### Labels
The model classifies text into two categories:
- **0**: Non-spam (legitimate message)
- **1**: Spam (unwanted/malicious message)
### Detection Capabilities
The model can effectively detect various types of spam including:
- Offensive and abusive language
- Profane content
- Gibberish text and random characters
- Suspicious links and URLs
- Promotional spam
- Fraudulent messages
## Use this Model
### Installation
First, install the required dependencies:
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "nahiar/spam-analysis"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example texts to classify
texts = [
"Halo, bagaimana kabar Anda hari ini?", # Non-spam
"MENANG JUTAAN RUPIAH! Klik link ini sekarang: http://suspicious-link.com", # Spam
"adsfwcasdfad12345", # Spam (gibberish)
"Terima kasih atas informasinya" # Non-spam
]
# Tokenize and predict
for text in texts:
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(prediction, dim=1).item()
confidence = torch.max(prediction, dim=1)[0].item()
label = "Spam" if predicted_class == 1 else "Non-spam"
print(f"Text: {text}")
print(f"Prediction: {label} (confidence: {confidence:.4f})")
print("-" * 50)
```
### Batch Processing
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
def classify_spam_batch(texts, model_name="nahiar/spam-analysis"):
"""
Classify multiple texts for spam detection
Args:
texts (list): List of texts to classify
model_name (str): Hugging Face model name
Returns:
list: List of predictions with confidence scores
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize all texts
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_classes = torch.argmax(predictions, dim=1)
confidences = torch.max(predictions, dim=1)[0]
results = []
for i, text in enumerate(texts):
results.append({
'text': text,
'is_spam': bool(predicted_classes[i].item()),
'confidence': confidences[i].item(),
'label': 'Spam' if predicted_classes[i].item() == 1 else 'Non-spam'
})
return results
# Example usage
texts = [
"Selamat pagi, semoga harimu menyenangkan",
"URGENT!!! Dapatkan uang 10 juta hanya dengan klik link ini",
"Terima kasih sudah membantu kemarin"
]
results = classify_spam_batch(texts)
for result in results:
print(f"Text: {result['text']}")
print(f"Label: {result['label']} (Confidence: {result['confidence']:.4f})")
print()
```
## Model Performance
This model has been trained on a diverse dataset of Indonesian text messages and demonstrates strong performance in distinguishing between spam and legitimate messages across various contexts including:
- WhatsApp chatbot interactions
- SMS messages
- Social media content
- Customer service communications
## Limitations
- The model is primarily trained on Indonesian language text
- Performance may vary with very short messages (< 10 characters)
- Context-dependent spam (messages that are spam only in specific contexts) may be challenging
## Repository
For more information about the training process and code implementation, visit:
[https://github.com/nahiar/spam-analysis](https://github.com/nahiar/spam-analysis)
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@misc{spam-analysis-indo,
title={Indonesian Spam Detection Model},
author={Nahiar},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/nahiar/spam-analysis}
}
```
|