nahiar's picture
Upload folder using huggingface_hub
df39e77 verified
---
language: "en"
license: "apache-2.0"
library_name: "scikit-learn"
tags:
- "bot-detection"
- "twitter"
- "classification"
- "scikit-learn"
- "random-forest"
---
# TWITTER Bot Detection Model
## Overview
This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.
**Model Version:** v2
**Training Date:** 2025-11-27 12:08:54
**Framework:** scikit-learn 1.5.2
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
---
## πŸ“Š Model Performance
### Final Metrics (Test Set)
| Metric | Score |
| --------------------- | --------------- |
| **Accuracy** | 0.8771 (87.71%) |
| **Precision** | 0.8595 (85.95%) |
| **Recall** | 0.7558 (75.58%) |
| **F1-Score** | 0.8043 (80.43%) |
| **ROC-AUC** | 0.9354 (93.54%) |
| **Average Precision** | 0.9008 (90.08%) |
### Model Improvement
- **Baseline ROC-AUC:** 0.9314
- **Tuned ROC-AUC:** 0.9354
- **Improvement:** 0.0040 (0.43%)
---
## πŸ—‚οΈ Files
| File | Description |
| ------------------------------ | -------------------------------------- |
| `twitter_bot_detection_v2.pkl` | Trained Random Forest model |
| `twitter_scaler_v2.pkl` | MinMaxScaler for feature normalization |
| `twitter_features_v2.json` | List of features used by the model |
| `twitter_metrics_v2.txt` | Detailed performance metrics report |
| `images/` | All visualization plots (13 images) |
| `README.md` | This file |
---
## 🎯 Dataset Information
### Training Configuration
- **Training Samples:** 29,951
- **Test Samples:** 7,487
- **Total Samples:** 37,438
- **Number of Features:** 12
- **Cross-Validation Folds:** 5
- **Random State:** 42
### Class Distribution
**Training Set:**
- Human (0): 20,028 (66.87%)
- Bot (1): 9,923 (33.13%)
**Test Set:**
- Human (0): 4,985 (66.58%)
- Bot (1): 2,502 (33.42%)
---
## πŸ”§ Features (12)
1. `has_custom_cover_image`
2. `description_length`
3. `favourites_count`
4. `followers_count`
5. `friends_count`
6. `followers_to_friends_ratio`
7. `has_location`
8. `username_digit_count`
9. `username_length`
10. `statuses_count`
11. `is_verified`
12. `account_age_days`
---
## πŸ† Top 5 Most Important Features
4. **followers_count** - 0.1895
5. **favourites_count** - 0.1813
6. **friends_count** - 0.1494
7. **statuses_count** - 0.1244
8. **account_age_days** - 0.1010
---
## βš™οΈ Hyperparameters
### Best Parameters (from GridSearchCV)
- **class_weight:** balanced
- **max_depth:** 20
- **max_features:** sqrt
- **min_samples_leaf:** 1
- **min_samples_split:** 2
- **n_estimators:** 300
### Parameter Search Space
- **n_estimators:** [100, 200, 300]
- **max_depth:** [10, 15, 20, None]
- **min_samples_split:** [2, 5, 10]
- **min_samples_leaf:** [1, 2, 4]
- **max_features:** ['sqrt', 'log2']
- **bootstrap:** [True, False]
**Total combinations tested:** 540
---
## πŸ“ˆ Cross-Validation Results
### Mean Scores (5-Fold Stratified CV)
- **Accuracy:** 0.8750 (Β±0.0053)
- **Precision:** 0.8658 (Β±0.0089)
- **Recall:** 0.7368 (Β±0.0113)
- **F1-Score:** 0.7961 (Β±0.0092)
- **ROC-AUC:** 0.9325 (Β±0.0037)
---
## πŸ–ΌοΈ Visualizations
All visualizations are saved in the `images/` directory:
1. **01_class_distribution.png** - Training/Test set class distribution
2. **02_feature_correlation.png** - Feature correlation with target variable
3. **03_correlation_matrix.png** - Feature correlation heatmap
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
5. **05_baseline_roc_curve.png** - Baseline ROC curve
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
7. **07_baseline_feature_importance.png** - Baseline feature importance
8. **08_cross_validation.png** - Cross-validation score distribution
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
10. **10_tuned_roc_curve.png** - Tuned ROC curve
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
12. **12_tuned_feature_importance.png** - Tuned feature importance
13. **13_model_comparison.png** - Baseline vs Tuned comparison
---
## πŸš€ Usage Example
```python
import joblib
import pandas as pd
import numpy as np
# Load model and scaler
model = joblib.load('twitter_bot_detection_v2.pkl')
scaler = joblib.load('twitter_scaler_v2.pkl')
# Prepare your data (example)
data = {
'has_custom_cover_image': 0.5,
'description_length': 0.5,
'favourites_count': 0.5,
'followers_count': 0.5,
'friends_count': 0.5,
'followers_to_friends_ratio': 0.5,
'has_location': 0.5,
'username_digit_count': 0.5,
'username_length': 0.5,
'statuses_count': 0.5,
'is_verified': 0.5,
'account_age_days': 0.5,
}
# Create DataFrame
df = pd.DataFrame([data])
# Scale features
df_scaled = scaler.transform(df)
# Predict
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]
print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")
```
---
## πŸ“‹ Confusion Matrix Breakdown
### Tuned Model (Test Set)
```
Predicted
Human Bot
Actual Human 4676 309
Bot 611 1891
```
- **True Negatives (TN):** 4,676 (Correctly identified humans)
- **False Positives (FP):** 309 (Humans incorrectly classified as bots)
- **False Negatives (FN):** 611 (Bots incorrectly classified as humans)
- **True Positives (TP):** 1,891 (Correctly identified bots)
---
## πŸ” Model Interpretation
### Strengths
- High ROC-AUC score (0.9354) indicates excellent discrimination capability
- Balanced precision and recall for both classes
- Robust cross-validation performance
### Key Insights
1. Top features drive bot classification effectively
2. GridSearchCV improved performance over baseline by 0.43%
3. Model generalizes well on unseen test data
---
## πŸ“ Notes
- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
- **Missing Values:** Filled with 0 during preprocessing
- **Class Balance:** Imbalanced dataset
- **Model Type:** Ensemble method resistant to overfitting
---
## πŸ”„ Model Updates
To retrain the model:
1. Place new training data in `../data/train_twitter.csv`
2. Run the training notebook: `5_enhanced_training.ipynb`
3. Update this README with new metrics
---
## πŸ“§ Contact & Support
For questions or issues regarding this model, please refer to the main project documentation.
---
**Generated:** 2025-11-27 12:08:54
**Notebook:** `5_enhanced_training.ipynb`
**Platform:** Twitter