--- language: en license: mit tags: - text-classification - bot-detection - social-media - distilroberta - pytorch - transformers datasets: - custom widget: - text: "🔥 AMAZING DEAL! Get 90% OFF now! Limited time only! Click here: bit.ly/deal123" example_title: "Promotional Bot Text" - text: "Just finished reading an interesting article about machine learning applications in healthcare." example_title: "Human-like Text" - text: "Follow for follow? Like my posts and I'll like yours back! 💯" example_title: "Social Media Bot" - text: "Had a wonderful dinner with my family tonight. These moments are precious." example_title: "Authentic Human Text" metrics: - accuracy - f1 - precision - recall model-index: - name: distilroberta-bot-detection results: - task: type: text-classification name: Bot Detection metrics: - type: accuracy value: 0.9423 name: Test Accuracy - type: f1 value: 0.9424 name: Test F1-Score (Weighted) - type: precision value: 0.9428 name: Test Precision (Weighted) - type: recall value: 0.9423 name: Test Recall (Weighted) --- # Bot Detection Model - DistilRoBERTa ## Model Description This model is a fine-tuned DistilRoBERTa-base model for binary classification of social media text to distinguish between human-authored and bot-generated content. The model uses class-weighted training to handle dataset imbalance and has been validated using 5-fold cross-validation. ## Performance ### Cross-Validation Results (5-Fold) | Metric | Mean ± Std | Range | |--------|------------|-------| | **Accuracy** | 0.9433 ± 0.0052 | 0.9385 - 0.9497 | | **F1-Score (Weighted)** | 0.9434 ± 0.0051 | 0.9387 - 0.9497 | | **Precision (Weighted)** | 0.9444 ± 0.0045 | 0.9397 - 0.9498 | ### Test Set Performance - **Accuracy**: 0.9423 - **F1-Score (Weighted)**: 0.9424 - **Precision (Weighted)**: 0.9428 - **Recall (Weighted)**: 0.9423 - **Inference Speed**: 232.83 samples/second ## Usage ### Quick Start ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch import re # Load model and tokenizer model_name = "junaid1993/distilroberta-bot-detection" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) def preprocess_text(text): """Clean text for bot detection""" if not isinstance(text, str): return "" # Remove URLs text = re.sub(r'http\S+|www\.\S+', '', text) # Remove @ and # symbols text = re.sub(r'[@#]', '', text) # Remove punctuation and special characters text = re.sub(r'[^\w\s]', '', text) # Remove numbers text = re.sub(r'\d+', '', text) # Clean whitespace text = re.sub(r'\s+', ' ', text).strip() return text.lower() def predict_bot(text, threshold=0.5): """Predict if text is bot-generated""" clean_text = preprocess_text(text) if not clean_text: return {"prediction": "unknown", "confidence": 0.5} inputs = tokenizer( clean_text, return_tensors="pt", truncation=True, padding=True, max_length=512 ) with torch.no_grad(): outputs = model(**inputs) probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) bot_prob = probabilities[0][1].item() prediction = "bot" if bot_prob > threshold else "human" return { "prediction": prediction, "bot_probability": round(bot_prob, 4), "human_probability": round(probabilities[0][0].item(), 4) } # Example usage text = "🔥 AMAZING DEAL! Click here now!" result = predict_bot(text) print(f"Prediction: {result['prediction']} (Bot: {result['bot_probability']})") ``` ## Training Details ### Model Architecture - **Base Model**: distilroberta-base - **Task**: Binary sequence classification - **Classes**: Human (0) vs Bot (1) - **Parameters**: ~82M parameters ### Training Configuration - **Epochs**: 10 (with early stopping) - **Batch Size**: 2 per device, gradient accumulation steps: 8 - **Learning Rate**: Automatic (AdamW optimizer) - **Weight Decay**: 0.01 - **Mixed Precision**: FP16 - **Class Weighting**: Applied to handle dataset imbalance ### Data Preprocessing 1. URL removal 2. Special character cleaning (@ symbols, hashtags) 3. Punctuation removal 4. Number removal 5. Whitespace normalization 6. Lowercase conversion ### Validation Methodology - **Cross-Validation**: 5-fold Stratified K-Fold - **Test Split**: 20% holdout set - **Metrics**: Accuracy, Precision, Recall, F1-score (both weighted and macro) ## Limitations - **Domain**: Primarily trained on social media text patterns - **Language**: English text only - **Temporal**: Bot patterns may evolve over time, requiring retraining - **Context**: Performance may vary with text length and complexity ## Intended Use This model is designed for: - Social media content moderation - Academic research on bot detection - Content analysis and verification ## Ethical Considerations - This model should be used responsibly and not for harassment - Results should be interpreted with appropriate confidence thresholds - Human oversight is recommended for critical decisions - Regular model updates may be needed as bot techniques evolve ## Citation ```bibtex @model{distilroberta-bot-detection-2024, title={Bot Detection Model using DistilRoBERTa}, author={Junaid Ahmed and Dariusz Jemielniak and Leon Ciechanowski}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/junaid1993/distilroberta-bot-detection} } ``` ## License MIT License --- **Model Card Created**: 2025-08-23 **Framework**: PyTorch + Transformers **Validation**: 5-Fold Cross-Validation with Class Weighting