spam-analysis / README.md

Raihan Hidayatullah Djunaedi

Update README.md to enhance model description, installation instructions, and usage examples

bd6d389 7 months ago

4.97 kB

	---
	language:
	- id
	base_model:
	- google/gemma-2-2b
	pipeline_tag: text-classification
	library_name: transformers
	tags:
	- spam-detection
	- text-classification
	- indonesian
	- chatbot
	- security
	---

	# Indonesian Spam Detection Model

	## Model Overview

	Indonesian Spam Detection Model is a fine-tuned spam detection model based on the Gemma 2 2B architecture. This model is specifically designed for identifying spam messages in Indonesian text, particularly for WhatsApp chatbot interactions. It has been fine-tuned using a comprehensive dataset of 40,000 spam messages collected over a year.

	### Labels

	The model classifies text into two categories:

	- 0: Non-spam (legitimate message)
	- 1: Spam (unwanted/malicious message)

	### Detection Capabilities

	The model can effectively detect various types of spam including:

	- Offensive and abusive language
	- Profane content
	- Gibberish text and random characters
	- Suspicious links and URLs
	- Promotional spam
	- Fraudulent messages

	## Use this Model

	### Installation

	First, install the required dependencies:

	```bash
	pip install transformers torch
	```

	### Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "nahiar/spam-analysis"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Example texts to classify
	texts = [
	"Halo, bagaimana kabar Anda hari ini?", # Non-spam
	"MENANG JUTAAN RUPIAH! Klik link ini sekarang: http://suspicious-link.com", # Spam
	"adsfwcasdfad12345", # Spam (gibberish)
	"Terima kasih atas informasinya" # Non-spam
	]

	# Tokenize and predict
	for text in texts:
	inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

	with torch.no_grad():
	outputs = model(**inputs)
	prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(prediction, dim=1).item()
	confidence = torch.max(prediction, dim=1)[0].item()

	label = "Spam" if predicted_class == 1 else "Non-spam"
	print(f"Text: {text}")
	print(f"Prediction: {label} (confidence: {confidence:.4f})")
	print("-" * 50)
	```

	### Batch Processing

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	def classify_spam_batch(texts, model_name="nahiar/spam-analysis"):
	"""
	Classify multiple texts for spam detection

	Args:
	texts (list): List of texts to classify
	model_name (str): Hugging Face model name

	Returns:
	list: List of predictions with confidence scores
	"""
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Tokenize all texts
	inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_classes = torch.argmax(predictions, dim=1)
	confidences = torch.max(predictions, dim=1)[0]

	results = []
	for i, text in enumerate(texts):
	results.append({
	'text': text,
	'is_spam': bool(predicted_classes[i].item()),
	'confidence': confidences[i].item(),
	'label': 'Spam' if predicted_classes[i].item() == 1 else 'Non-spam'
	})

	return results

	# Example usage
	texts = [
	"Selamat pagi, semoga harimu menyenangkan",
	"URGENT!!! Dapatkan uang 10 juta hanya dengan klik link ini",
	"Terima kasih sudah membantu kemarin"
	]

	results = classify_spam_batch(texts)
	for result in results:
	print(f"Text: {result['text']}")
	print(f"Label: {result['label']} (Confidence: {result['confidence']:.4f})")
	print()
	```

	## Model Performance

	This model has been trained on a diverse dataset of Indonesian text messages and demonstrates strong performance in distinguishing between spam and legitimate messages across various contexts including:

	- WhatsApp chatbot interactions
	- SMS messages
	- Social media content
	- Customer service communications

	## Limitations

	- The model is primarily trained on Indonesian language text
	- Performance may vary with very short messages (< 10 characters)
	- Context-dependent spam (messages that are spam only in specific contexts) may be challenging

	## Repository

	For more information about the training process and code implementation, visit:

	[https://github.com/nahiar/spam-analysis](https://github.com/nahiar/spam-analysis)

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@misc{spam-analysis-indo,
	title={Indonesian Spam Detection Model},
	author={Nahiar},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/nahiar/spam-analysis}
	}
	```