|
|
--- |
|
|
language: "en" |
|
|
license: "apache-2.0" |
|
|
library_name: "scikit-learn" |
|
|
tags: |
|
|
- "bot-detection" |
|
|
- "twitter" |
|
|
- "classification" |
|
|
- "scikit-learn" |
|
|
- "random-forest" |
|
|
--- |
|
|
|
|
|
# TWITTER Bot Detection Model |
|
|
|
|
|
## Overview |
|
|
|
|
|
This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter. |
|
|
|
|
|
**Model Version:** v2 |
|
|
**Training Date:** 2025-11-27 12:08:54 |
|
|
**Framework:** scikit-learn 1.5.2 |
|
|
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Performance |
|
|
|
|
|
### Final Metrics (Test Set) |
|
|
|
|
|
| Metric | Score | |
|
|
| --------------------- | --------------- | |
|
|
| **Accuracy** | 0.8771 (87.71%) | |
|
|
| **Precision** | 0.8595 (85.95%) | |
|
|
| **Recall** | 0.7558 (75.58%) | |
|
|
| **F1-Score** | 0.8043 (80.43%) | |
|
|
| **ROC-AUC** | 0.9354 (93.54%) | |
|
|
| **Average Precision** | 0.9008 (90.08%) | |
|
|
|
|
|
### Model Improvement |
|
|
|
|
|
- **Baseline ROC-AUC:** 0.9314 |
|
|
- **Tuned ROC-AUC:** 0.9354 |
|
|
- **Improvement:** 0.0040 (0.43%) |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Files |
|
|
|
|
|
| File | Description | |
|
|
| ------------------------------ | -------------------------------------- | |
|
|
| `twitter_bot_detection_v2.pkl` | Trained Random Forest model | |
|
|
| `twitter_scaler_v2.pkl` | MinMaxScaler for feature normalization | |
|
|
| `twitter_features_v2.json` | List of features used by the model | |
|
|
| `twitter_metrics_v2.txt` | Detailed performance metrics report | |
|
|
| `images/` | All visualization plots (13 images) | |
|
|
| `README.md` | This file | |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Dataset Information |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Training Samples:** 29,951 |
|
|
- **Test Samples:** 7,487 |
|
|
- **Total Samples:** 37,438 |
|
|
- **Number of Features:** 12 |
|
|
- **Cross-Validation Folds:** 5 |
|
|
- **Random State:** 42 |
|
|
|
|
|
### Class Distribution |
|
|
|
|
|
**Training Set:** |
|
|
|
|
|
- Human (0): 20,028 (66.87%) |
|
|
- Bot (1): 9,923 (33.13%) |
|
|
|
|
|
**Test Set:** |
|
|
|
|
|
- Human (0): 4,985 (66.58%) |
|
|
- Bot (1): 2,502 (33.42%) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Features (12) |
|
|
|
|
|
1. `has_custom_cover_image` |
|
|
2. `description_length` |
|
|
3. `favourites_count` |
|
|
4. `followers_count` |
|
|
5. `friends_count` |
|
|
6. `followers_to_friends_ratio` |
|
|
7. `has_location` |
|
|
8. `username_digit_count` |
|
|
9. `username_length` |
|
|
10. `statuses_count` |
|
|
11. `is_verified` |
|
|
12. `account_age_days` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Top 5 Most Important Features |
|
|
|
|
|
4. **followers_count** - 0.1895 |
|
|
5. **favourites_count** - 0.1813 |
|
|
6. **friends_count** - 0.1494 |
|
|
7. **statuses_count** - 0.1244 |
|
|
8. **account_age_days** - 0.1010 |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Hyperparameters |
|
|
|
|
|
### Best Parameters (from GridSearchCV) |
|
|
|
|
|
- **class_weight:** balanced |
|
|
- **max_depth:** 20 |
|
|
- **max_features:** sqrt |
|
|
- **min_samples_leaf:** 1 |
|
|
- **min_samples_split:** 2 |
|
|
- **n_estimators:** 300 |
|
|
|
|
|
### Parameter Search Space |
|
|
|
|
|
- **n_estimators:** [100, 200, 300] |
|
|
- **max_depth:** [10, 15, 20, None] |
|
|
- **min_samples_split:** [2, 5, 10] |
|
|
- **min_samples_leaf:** [1, 2, 4] |
|
|
- **max_features:** ['sqrt', 'log2'] |
|
|
- **bootstrap:** [True, False] |
|
|
|
|
|
**Total combinations tested:** 540 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Cross-Validation Results |
|
|
|
|
|
### Mean Scores (5-Fold Stratified CV) |
|
|
|
|
|
- **Accuracy:** 0.8750 (Β±0.0053) |
|
|
- **Precision:** 0.8658 (Β±0.0089) |
|
|
- **Recall:** 0.7368 (Β±0.0113) |
|
|
- **F1-Score:** 0.7961 (Β±0.0092) |
|
|
- **ROC-AUC:** 0.9325 (Β±0.0037) |
|
|
|
|
|
--- |
|
|
|
|
|
## πΌοΈ Visualizations |
|
|
|
|
|
All visualizations are saved in the `images/` directory: |
|
|
|
|
|
1. **01_class_distribution.png** - Training/Test set class distribution |
|
|
2. **02_feature_correlation.png** - Feature correlation with target variable |
|
|
3. **03_correlation_matrix.png** - Feature correlation heatmap |
|
|
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix |
|
|
5. **05_baseline_roc_curve.png** - Baseline ROC curve |
|
|
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve |
|
|
7. **07_baseline_feature_importance.png** - Baseline feature importance |
|
|
8. **08_cross_validation.png** - Cross-validation score distribution |
|
|
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix |
|
|
10. **10_tuned_roc_curve.png** - Tuned ROC curve |
|
|
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve |
|
|
12. **12_tuned_feature_importance.png** - Tuned feature importance |
|
|
13. **13_model_comparison.png** - Baseline vs Tuned comparison |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage Example |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
import pandas as pd |
|
|
import numpy as np |
|
|
|
|
|
# Load model and scaler |
|
|
model = joblib.load('twitter_bot_detection_v2.pkl') |
|
|
scaler = joblib.load('twitter_scaler_v2.pkl') |
|
|
|
|
|
# Prepare your data (example) |
|
|
data = { |
|
|
'has_custom_cover_image': 0.5, |
|
|
'description_length': 0.5, |
|
|
'favourites_count': 0.5, |
|
|
'followers_count': 0.5, |
|
|
'friends_count': 0.5, |
|
|
'followers_to_friends_ratio': 0.5, |
|
|
'has_location': 0.5, |
|
|
'username_digit_count': 0.5, |
|
|
'username_length': 0.5, |
|
|
'statuses_count': 0.5, |
|
|
'is_verified': 0.5, |
|
|
'account_age_days': 0.5, |
|
|
} |
|
|
|
|
|
# Create DataFrame |
|
|
df = pd.DataFrame([data]) |
|
|
|
|
|
# Scale features |
|
|
df_scaled = scaler.transform(df) |
|
|
|
|
|
# Predict |
|
|
prediction = model.predict(df_scaled)[0] |
|
|
probability = model.predict_proba(df_scaled)[0] |
|
|
|
|
|
print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}") |
|
|
print(f"Bot Probability: {probability[1]:.4f}") |
|
|
print(f"Human Probability: {probability[0]:.4f}") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Confusion Matrix Breakdown |
|
|
|
|
|
### Tuned Model (Test Set) |
|
|
|
|
|
``` |
|
|
Predicted |
|
|
Human Bot |
|
|
Actual Human 4676 309 |
|
|
Bot 611 1891 |
|
|
``` |
|
|
|
|
|
- **True Negatives (TN):** 4,676 (Correctly identified humans) |
|
|
- **False Positives (FP):** 309 (Humans incorrectly classified as bots) |
|
|
- **False Negatives (FN):** 611 (Bots incorrectly classified as humans) |
|
|
- **True Positives (TP):** 1,891 (Correctly identified bots) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Interpretation |
|
|
|
|
|
### Strengths |
|
|
|
|
|
- High ROC-AUC score (0.9354) indicates excellent discrimination capability |
|
|
- Balanced precision and recall for both classes |
|
|
- Robust cross-validation performance |
|
|
|
|
|
### Key Insights |
|
|
|
|
|
1. Top features drive bot classification effectively |
|
|
2. GridSearchCV improved performance over baseline by 0.43% |
|
|
3. Model generalizes well on unseen test data |
|
|
|
|
|
--- |
|
|
|
|
|
## π Notes |
|
|
|
|
|
- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range |
|
|
- **Missing Values:** Filled with 0 during preprocessing |
|
|
- **Class Balance:** Imbalanced dataset |
|
|
- **Model Type:** Ensemble method resistant to overfitting |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Updates |
|
|
|
|
|
To retrain the model: |
|
|
|
|
|
1. Place new training data in `../data/train_twitter.csv` |
|
|
2. Run the training notebook: `5_enhanced_training.ipynb` |
|
|
3. Update this README with new metrics |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Contact & Support |
|
|
|
|
|
For questions or issues regarding this model, please refer to the main project documentation. |
|
|
|
|
|
--- |
|
|
|
|
|
**Generated:** 2025-11-27 12:08:54 |
|
|
**Notebook:** `5_enhanced_training.ipynb` |
|
|
**Platform:** Twitter |
|
|
|