Raihan Hidayatullah Djunaedi
Update README.md to enhance model description, installation instructions, and usage examples
bd6d389
| language: | |
| - id | |
| base_model: | |
| - google/gemma-2-2b | |
| pipeline_tag: text-classification | |
| library_name: transformers | |
| tags: | |
| - spam-detection | |
| - text-classification | |
| - indonesian | |
| - chatbot | |
| - security | |
| # Indonesian Spam Detection Model | |
| ## Model Overview | |
| **Indonesian Spam Detection Model** is a fine-tuned spam detection model based on the **Gemma 2 2B** architecture. This model is specifically designed for identifying spam messages in Indonesian text, particularly for WhatsApp chatbot interactions. It has been fine-tuned using a comprehensive dataset of 40,000 spam messages collected over a year. | |
| ### Labels | |
| The model classifies text into two categories: | |
| - **0**: Non-spam (legitimate message) | |
| - **1**: Spam (unwanted/malicious message) | |
| ### Detection Capabilities | |
| The model can effectively detect various types of spam including: | |
| - Offensive and abusive language | |
| - Profane content | |
| - Gibberish text and random characters | |
| - Suspicious links and URLs | |
| - Promotional spam | |
| - Fraudulent messages | |
| ## Use this Model | |
| ### Installation | |
| First, install the required dependencies: | |
| ```bash | |
| pip install transformers torch | |
| ``` | |
| ### Quick Start | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| # Load model and tokenizer | |
| model_name = "nahiar/spam-analysis" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| # Example texts to classify | |
| texts = [ | |
| "Halo, bagaimana kabar Anda hari ini?", # Non-spam | |
| "MENANG JUTAAN RUPIAH! Klik link ini sekarang: http://suspicious-link.com", # Spam | |
| "adsfwcasdfad12345", # Spam (gibberish) | |
| "Terima kasih atas informasinya" # Non-spam | |
| ] | |
| # Tokenize and predict | |
| for text in texts: | |
| inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| prediction = torch.nn.functional.softmax(outputs.logits, dim=-1) | |
| predicted_class = torch.argmax(prediction, dim=1).item() | |
| confidence = torch.max(prediction, dim=1)[0].item() | |
| label = "Spam" if predicted_class == 1 else "Non-spam" | |
| print(f"Text: {text}") | |
| print(f"Prediction: {label} (confidence: {confidence:.4f})") | |
| print("-" * 50) | |
| ``` | |
| ### Batch Processing | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| import torch | |
| def classify_spam_batch(texts, model_name="nahiar/spam-analysis"): | |
| """ | |
| Classify multiple texts for spam detection | |
| Args: | |
| texts (list): List of texts to classify | |
| model_name (str): Hugging Face model name | |
| Returns: | |
| list: List of predictions with confidence scores | |
| """ | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| # Tokenize all texts | |
| inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=512) | |
| with torch.no_grad(): | |
| outputs = model(**inputs) | |
| predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) | |
| predicted_classes = torch.argmax(predictions, dim=1) | |
| confidences = torch.max(predictions, dim=1)[0] | |
| results = [] | |
| for i, text in enumerate(texts): | |
| results.append({ | |
| 'text': text, | |
| 'is_spam': bool(predicted_classes[i].item()), | |
| 'confidence': confidences[i].item(), | |
| 'label': 'Spam' if predicted_classes[i].item() == 1 else 'Non-spam' | |
| }) | |
| return results | |
| # Example usage | |
| texts = [ | |
| "Selamat pagi, semoga harimu menyenangkan", | |
| "URGENT!!! Dapatkan uang 10 juta hanya dengan klik link ini", | |
| "Terima kasih sudah membantu kemarin" | |
| ] | |
| results = classify_spam_batch(texts) | |
| for result in results: | |
| print(f"Text: {result['text']}") | |
| print(f"Label: {result['label']} (Confidence: {result['confidence']:.4f})") | |
| print() | |
| ``` | |
| ## Model Performance | |
| This model has been trained on a diverse dataset of Indonesian text messages and demonstrates strong performance in distinguishing between spam and legitimate messages across various contexts including: | |
| - WhatsApp chatbot interactions | |
| - SMS messages | |
| - Social media content | |
| - Customer service communications | |
| ## Limitations | |
| - The model is primarily trained on Indonesian language text | |
| - Performance may vary with very short messages (< 10 characters) | |
| - Context-dependent spam (messages that are spam only in specific contexts) may be challenging | |
| ## Repository | |
| For more information about the training process and code implementation, visit: | |
| [https://github.com/nahiar/spam-analysis](https://github.com/nahiar/spam-analysis) | |
| ## Citation | |
| If you use this model in your research or applications, please cite: | |
| ```bibtex | |
| @misc{spam-analysis-indo, | |
| title={Indonesian Spam Detection Model}, | |
| author={Nahiar}, | |
| year={2025}, | |
| publisher={Hugging Face}, | |
| url={https://huggingface.co/nahiar/spam-analysis} | |
| } | |
| ``` | |