--- language: - id base_model: - google/gemma-2-2b pipeline_tag: text-classification library_name: transformers tags: - spam-detection - text-classification - indonesian - chatbot - security --- # Indonesian Spam Detection Model ## Model Overview **Indonesian Spam Detection Model** is a fine-tuned spam detection model based on the **Gemma 2 2B** architecture. This model is specifically designed for identifying spam messages in Indonesian text, particularly for WhatsApp chatbot interactions. It has been fine-tuned using a comprehensive dataset of 40,000 spam messages collected over a year. ### Labels The model classifies text into two categories: - **0**: Non-spam (legitimate message) - **1**: Spam (unwanted/malicious message) ### Detection Capabilities The model can effectively detect various types of spam including: - Offensive and abusive language - Profane content - Gibberish text and random characters - Suspicious links and URLs - Promotional spam - Fraudulent messages ## Use this Model ### Installation First, install the required dependencies: ```bash pip install transformers torch ``` ### Quick Start ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "nahiar/spam-analysis" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example texts to classify texts = [ "Halo, bagaimana kabar Anda hari ini?", # Non-spam "MENANG JUTAAN RUPIAH! Klik link ini sekarang: http://suspicious-link.com", # Spam "adsfwcasdfad12345", # Spam (gibberish) "Terima kasih atas informasinya" # Non-spam ] # Tokenize and predict for text in texts: inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) prediction = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(prediction, dim=1).item() confidence = torch.max(prediction, dim=1)[0].item() label = "Spam" if predicted_class == 1 else "Non-spam" print(f"Text: {text}") print(f"Prediction: {label} (confidence: {confidence:.4f})") print("-" * 50) ``` ### Batch Processing ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch def classify_spam_batch(texts, model_name="nahiar/spam-analysis"): """ Classify multiple texts for spam detection Args: texts (list): List of texts to classify model_name (str): Hugging Face model name Returns: list: List of predictions with confidence scores """ tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Tokenize all texts inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_classes = torch.argmax(predictions, dim=1) confidences = torch.max(predictions, dim=1)[0] results = [] for i, text in enumerate(texts): results.append({ 'text': text, 'is_spam': bool(predicted_classes[i].item()), 'confidence': confidences[i].item(), 'label': 'Spam' if predicted_classes[i].item() == 1 else 'Non-spam' }) return results # Example usage texts = [ "Selamat pagi, semoga harimu menyenangkan", "URGENT!!! Dapatkan uang 10 juta hanya dengan klik link ini", "Terima kasih sudah membantu kemarin" ] results = classify_spam_batch(texts) for result in results: print(f"Text: {result['text']}") print(f"Label: {result['label']} (Confidence: {result['confidence']:.4f})") print() ``` ## Model Performance This model has been trained on a diverse dataset of Indonesian text messages and demonstrates strong performance in distinguishing between spam and legitimate messages across various contexts including: - WhatsApp chatbot interactions - SMS messages - Social media content - Customer service communications ## Limitations - The model is primarily trained on Indonesian language text - Performance may vary with very short messages (< 10 characters) - Context-dependent spam (messages that are spam only in specific contexts) may be challenging ## Repository For more information about the training process and code implementation, visit: [https://github.com/nahiar/spam-analysis](https://github.com/nahiar/spam-analysis) ## Citation If you use this model in your research or applications, please cite: ```bibtex @misc{spam-analysis-indo, title={Indonesian Spam Detection Model}, author={Nahiar}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/nahiar/spam-analysis} } ```