--- language: "en" license: "apache-2.0" library_name: "scikit-learn" tags: - "bot-detection" - "twitter" - "classification" - "scikit-learn" - "random-forest" --- # TWITTER Bot Detection Model ## Overview This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter. **Model Version:** v2 **Training Date:** 2025-11-27 12:08:54 **Framework:** scikit-learn 1.5.2 **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning --- ## 📊 Model Performance ### Final Metrics (Test Set) | Metric | Score | | --------------------- | --------------- | | **Accuracy** | 0.8771 (87.71%) | | **Precision** | 0.8595 (85.95%) | | **Recall** | 0.7558 (75.58%) | | **F1-Score** | 0.8043 (80.43%) | | **ROC-AUC** | 0.9354 (93.54%) | | **Average Precision** | 0.9008 (90.08%) | ### Model Improvement - **Baseline ROC-AUC:** 0.9314 - **Tuned ROC-AUC:** 0.9354 - **Improvement:** 0.0040 (0.43%) --- ## 🗂️ Files | File | Description | | ------------------------------ | -------------------------------------- | | `twitter_bot_detection_v2.pkl` | Trained Random Forest model | | `twitter_scaler_v2.pkl` | MinMaxScaler for feature normalization | | `twitter_features_v2.json` | List of features used by the model | | `twitter_metrics_v2.txt` | Detailed performance metrics report | | `images/` | All visualization plots (13 images) | | `README.md` | This file | --- ## 🎯 Dataset Information ### Training Configuration - **Training Samples:** 29,951 - **Test Samples:** 7,487 - **Total Samples:** 37,438 - **Number of Features:** 12 - **Cross-Validation Folds:** 5 - **Random State:** 42 ### Class Distribution **Training Set:** - Human (0): 20,028 (66.87%) - Bot (1): 9,923 (33.13%) **Test Set:** - Human (0): 4,985 (66.58%) - Bot (1): 2,502 (33.42%) --- ## 🔧 Features (12) 1. `has_custom_cover_image` 2. `description_length` 3. `favourites_count` 4. `followers_count` 5. `friends_count` 6. `followers_to_friends_ratio` 7. `has_location` 8. `username_digit_count` 9. `username_length` 10. `statuses_count` 11. `is_verified` 12. `account_age_days` --- ## 🏆 Top 5 Most Important Features 4. **followers_count** - 0.1895 5. **favourites_count** - 0.1813 6. **friends_count** - 0.1494 7. **statuses_count** - 0.1244 8. **account_age_days** - 0.1010 --- ## ⚙️ Hyperparameters ### Best Parameters (from GridSearchCV) - **class_weight:** balanced - **max_depth:** 20 - **max_features:** sqrt - **min_samples_leaf:** 1 - **min_samples_split:** 2 - **n_estimators:** 300 ### Parameter Search Space - **n_estimators:** [100, 200, 300] - **max_depth:** [10, 15, 20, None] - **min_samples_split:** [2, 5, 10] - **min_samples_leaf:** [1, 2, 4] - **max_features:** ['sqrt', 'log2'] - **bootstrap:** [True, False] **Total combinations tested:** 540 --- ## 📈 Cross-Validation Results ### Mean Scores (5-Fold Stratified CV) - **Accuracy:** 0.8750 (±0.0053) - **Precision:** 0.8658 (±0.0089) - **Recall:** 0.7368 (±0.0113) - **F1-Score:** 0.7961 (±0.0092) - **ROC-AUC:** 0.9325 (±0.0037) --- ## 🖼️ Visualizations All visualizations are saved in the `images/` directory: 1. **01_class_distribution.png** - Training/Test set class distribution 2. **02_feature_correlation.png** - Feature correlation with target variable 3. **03_correlation_matrix.png** - Feature correlation heatmap 4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix 5. **05_baseline_roc_curve.png** - Baseline ROC curve 6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve 7. **07_baseline_feature_importance.png** - Baseline feature importance 8. **08_cross_validation.png** - Cross-validation score distribution 9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix 10. **10_tuned_roc_curve.png** - Tuned ROC curve 11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve 12. **12_tuned_feature_importance.png** - Tuned feature importance 13. **13_model_comparison.png** - Baseline vs Tuned comparison --- ## 🚀 Usage Example ```python import joblib import pandas as pd import numpy as np # Load model and scaler model = joblib.load('twitter_bot_detection_v2.pkl') scaler = joblib.load('twitter_scaler_v2.pkl') # Prepare your data (example) data = { 'has_custom_cover_image': 0.5, 'description_length': 0.5, 'favourites_count': 0.5, 'followers_count': 0.5, 'friends_count': 0.5, 'followers_to_friends_ratio': 0.5, 'has_location': 0.5, 'username_digit_count': 0.5, 'username_length': 0.5, 'statuses_count': 0.5, 'is_verified': 0.5, 'account_age_days': 0.5, } # Create DataFrame df = pd.DataFrame([data]) # Scale features df_scaled = scaler.transform(df) # Predict prediction = model.predict(df_scaled)[0] probability = model.predict_proba(df_scaled)[0] print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}") print(f"Bot Probability: {probability[1]:.4f}") print(f"Human Probability: {probability[0]:.4f}") ``` --- ## 📋 Confusion Matrix Breakdown ### Tuned Model (Test Set) ``` Predicted Human Bot Actual Human 4676 309 Bot 611 1891 ``` - **True Negatives (TN):** 4,676 (Correctly identified humans) - **False Positives (FP):** 309 (Humans incorrectly classified as bots) - **False Negatives (FN):** 611 (Bots incorrectly classified as humans) - **True Positives (TP):** 1,891 (Correctly identified bots) --- ## 🔍 Model Interpretation ### Strengths - High ROC-AUC score (0.9354) indicates excellent discrimination capability - Balanced precision and recall for both classes - Robust cross-validation performance ### Key Insights 1. Top features drive bot classification effectively 2. GridSearchCV improved performance over baseline by 0.43% 3. Model generalizes well on unseen test data --- ## 📝 Notes - **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range - **Missing Values:** Filled with 0 during preprocessing - **Class Balance:** Imbalanced dataset - **Model Type:** Ensemble method resistant to overfitting --- ## 🔄 Model Updates To retrain the model: 1. Place new training data in `../data/train_twitter.csv` 2. Run the training notebook: `5_enhanced_training.ipynb` 3. Update this README with new metrics --- ## 📧 Contact & Support For questions or issues regarding this model, please refer to the main project documentation. --- **Generated:** 2025-11-27 12:08:54 **Notebook:** `5_enhanced_training.ipynb` **Platform:** Twitter