File size: 6,797 Bytes

8255c74
df39e77
 
 
8255c74
df39e77
 
 
 
 
8255c74
 
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
8255c74
df39e77
8255c74
df39e77
 
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
 
 
 
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
 
 
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8255c74
df39e77
 
8255c74
df39e77
 
 
 
 
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
 
 
8255c74
df39e77
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
8255c74
df39e77
8255c74
df39e77
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
 
 
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77
8255c74
df39e77

---
language: "en"
license: "apache-2.0"
library_name: "scikit-learn"
tags:
  - "bot-detection"
  - "twitter"
  - "classification"
  - "scikit-learn"
  - "random-forest"
---

# TWITTER Bot Detection Model

## Overview

This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.

**Model Version:** v2
**Training Date:** 2025-11-27 12:08:54
**Framework:** scikit-learn 1.5.2
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning

---

## 📊 Model Performance

### Final Metrics (Test Set)

| Metric                | Score           |
| --------------------- | --------------- |
| **Accuracy**          | 0.8771 (87.71%) |
| **Precision**         | 0.8595 (85.95%) |
| **Recall**            | 0.7558 (75.58%) |
| **F1-Score**          | 0.8043 (80.43%) |
| **ROC-AUC**           | 0.9354 (93.54%) |
| **Average Precision** | 0.9008 (90.08%) |

### Model Improvement

- **Baseline ROC-AUC:** 0.9314
- **Tuned ROC-AUC:** 0.9354
- **Improvement:** 0.0040 (0.43%)

---

## 🗂️ Files

| File                           | Description                            |
| ------------------------------ | -------------------------------------- |
| `twitter_bot_detection_v2.pkl` | Trained Random Forest model            |
| `twitter_scaler_v2.pkl`        | MinMaxScaler for feature normalization |
| `twitter_features_v2.json`     | List of features used by the model     |
| `twitter_metrics_v2.txt`       | Detailed performance metrics report    |
| `images/`                      | All visualization plots (13 images)    |
| `README.md`                    | This file                              |

---

## 🎯 Dataset Information

### Training Configuration

- **Training Samples:** 29,951
- **Test Samples:** 7,487
- **Total Samples:** 37,438
- **Number of Features:** 12
- **Cross-Validation Folds:** 5
- **Random State:** 42

### Class Distribution

**Training Set:**

- Human (0): 20,028 (66.87%)
- Bot (1): 9,923 (33.13%)

**Test Set:**

- Human (0): 4,985 (66.58%)
- Bot (1): 2,502 (33.42%)

---

## 🔧 Features (12)

1. `has_custom_cover_image`
2. `description_length`
3. `favourites_count`
4. `followers_count`
5. `friends_count`
6. `followers_to_friends_ratio`
7. `has_location`
8. `username_digit_count`
9. `username_length`
10. `statuses_count`
11. `is_verified`
12. `account_age_days`

---

## 🏆 Top 5 Most Important Features

4. **followers_count** - 0.1895
5. **favourites_count** - 0.1813
6. **friends_count** - 0.1494
7. **statuses_count** - 0.1244
8. **account_age_days** - 0.1010

---

## ⚙️ Hyperparameters

### Best Parameters (from GridSearchCV)

- **class_weight:** balanced
- **max_depth:** 20
- **max_features:** sqrt
- **min_samples_leaf:** 1
- **min_samples_split:** 2
- **n_estimators:** 300

### Parameter Search Space

- **n_estimators:** [100, 200, 300]
- **max_depth:** [10, 15, 20, None]
- **min_samples_split:** [2, 5, 10]
- **min_samples_leaf:** [1, 2, 4]
- **max_features:** ['sqrt', 'log2']
- **bootstrap:** [True, False]

**Total combinations tested:** 540

---

## 📈 Cross-Validation Results

### Mean Scores (5-Fold Stratified CV)

- **Accuracy:** 0.8750 (±0.0053)
- **Precision:** 0.8658 (±0.0089)
- **Recall:** 0.7368 (±0.0113)
- **F1-Score:** 0.7961 (±0.0092)
- **ROC-AUC:** 0.9325 (±0.0037)

---

## 🖼️ Visualizations

All visualizations are saved in the `images/` directory:

1. **01_class_distribution.png** - Training/Test set class distribution
2. **02_feature_correlation.png** - Feature correlation with target variable
3. **03_correlation_matrix.png** - Feature correlation heatmap
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
5. **05_baseline_roc_curve.png** - Baseline ROC curve
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
7. **07_baseline_feature_importance.png** - Baseline feature importance
8. **08_cross_validation.png** - Cross-validation score distribution
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
10. **10_tuned_roc_curve.png** - Tuned ROC curve
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
12. **12_tuned_feature_importance.png** - Tuned feature importance
13. **13_model_comparison.png** - Baseline vs Tuned comparison

---

## 🚀 Usage Example

```python
import joblib
import pandas as pd
import numpy as np

# Load model and scaler
model = joblib.load('twitter_bot_detection_v2.pkl')
scaler = joblib.load('twitter_scaler_v2.pkl')

# Prepare your data (example)
data = {
    'has_custom_cover_image': 0.5,
    'description_length': 0.5,
    'favourites_count': 0.5,
    'followers_count': 0.5,
    'friends_count': 0.5,
    'followers_to_friends_ratio': 0.5,
    'has_location': 0.5,
    'username_digit_count': 0.5,
    'username_length': 0.5,
    'statuses_count': 0.5,
    'is_verified': 0.5,
    'account_age_days': 0.5,
}

# Create DataFrame
df = pd.DataFrame([data])

# Scale features
df_scaled = scaler.transform(df)

# Predict
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]

print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")
```

---

## 📋 Confusion Matrix Breakdown

### Tuned Model (Test Set)

```
                Predicted
              Human    Bot
Actual Human     4676     309
       Bot        611    1891
```

- **True Negatives (TN):** 4,676 (Correctly identified humans)
- **False Positives (FP):** 309 (Humans incorrectly classified as bots)
- **False Negatives (FN):** 611 (Bots incorrectly classified as humans)
- **True Positives (TP):** 1,891 (Correctly identified bots)

---

## 🔍 Model Interpretation

### Strengths

- High ROC-AUC score (0.9354) indicates excellent discrimination capability
- Balanced precision and recall for both classes
- Robust cross-validation performance

### Key Insights

1. Top features drive bot classification effectively
2. GridSearchCV improved performance over baseline by 0.43%
3. Model generalizes well on unseen test data

---

## 📝 Notes

- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
- **Missing Values:** Filled with 0 during preprocessing
- **Class Balance:** Imbalanced dataset
- **Model Type:** Ensemble method resistant to overfitting

---

## 🔄 Model Updates

To retrain the model:

1. Place new training data in `../data/train_twitter.csv`
2. Run the training notebook: `5_enhanced_training.ipynb`
3. Update this README with new metrics

---

## 📧 Contact & Support

For questions or issues regarding this model, please refer to the main project documentation.

---

**Generated:** 2025-11-27 12:08:54
**Notebook:** `5_enhanced_training.ipynb`
**Platform:** Twitter