---
language: "en"
license: "apache-2.0"
library_name: "scikit-learn"
tags:
  - "bot-detection"
  - "twitter"
  - "classification"
  - "scikit-learn"
  - "random-forest"
---

# TWITTER Bot Detection Model

## Overview

This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.

**Model Version:** v2
**Training Date:** 2025-11-27 12:08:54
**Framework:** scikit-learn 1.5.2
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning

---

## 📊 Model Performance

### Final Metrics (Test Set)

| Metric                | Score           |
| --------------------- | --------------- |
| **Accuracy**          | 0.8771 (87.71%) |
| **Precision**         | 0.8595 (85.95%) |
| **Recall**            | 0.7558 (75.58%) |
| **F1-Score**          | 0.8043 (80.43%) |
| **ROC-AUC**           | 0.9354 (93.54%) |
| **Average Precision** | 0.9008 (90.08%) |

### Model Improvement

- **Baseline ROC-AUC:** 0.9314
- **Tuned ROC-AUC:** 0.9354
- **Improvement:** 0.0040 (0.43%)

---

## 🗂️ Files

| File                           | Description                            |
| ------------------------------ | -------------------------------------- |
| `twitter_bot_detection_v2.pkl` | Trained Random Forest model            |
| `twitter_scaler_v2.pkl`        | MinMaxScaler for feature normalization |
| `twitter_features_v2.json`     | List of features used by the model     |
| `twitter_metrics_v2.txt`       | Detailed performance metrics report    |
| `images/`                      | All visualization plots (13 images)    |
| `README.md`                    | This file                              |

---

## 🎯 Dataset Information

### Training Configuration

- **Training Samples:** 29,951
- **Test Samples:** 7,487
- **Total Samples:** 37,438
- **Number of Features:** 12
- **Cross-Validation Folds:** 5
- **Random State:** 42

### Class Distribution

**Training Set:**

- Human (0): 20,028 (66.87%)
- Bot (1): 9,923 (33.13%)

**Test Set:**

- Human (0): 4,985 (66.58%)
- Bot (1): 2,502 (33.42%)

---

## 🔧 Features (12)

1. `has_custom_cover_image`
2. `description_length`
3. `favourites_count`
4. `followers_count`
5. `friends_count`
6. `followers_to_friends_ratio`
7. `has_location`
8. `username_digit_count`
9. `username_length`
10. `statuses_count`
11. `is_verified`
12. `account_age_days`

---

## 🏆 Top 5 Most Important Features

4. **followers_count** - 0.1895
5. **favourites_count** - 0.1813
6. **friends_count** - 0.1494
7. **statuses_count** - 0.1244
8. **account_age_days** - 0.1010

---

## ⚙️ Hyperparameters

### Best Parameters (from GridSearchCV)

- **class_weight:** balanced
- **max_depth:** 20
- **max_features:** sqrt
- **min_samples_leaf:** 1
- **min_samples_split:** 2
- **n_estimators:** 300

### Parameter Search Space

- **n_estimators:** [100, 200, 300]
- **max_depth:** [10, 15, 20, None]
- **min_samples_split:** [2, 5, 10]
- **min_samples_leaf:** [1, 2, 4]
- **max_features:** ['sqrt', 'log2']
- **bootstrap:** [True, False]

**Total combinations tested:** 540

---

## 📈 Cross-Validation Results

### Mean Scores (5-Fold Stratified CV)

- **Accuracy:** 0.8750 (±0.0053)
- **Precision:** 0.8658 (±0.0089)
- **Recall:** 0.7368 (±0.0113)
- **F1-Score:** 0.7961 (±0.0092)
- **ROC-AUC:** 0.9325 (±0.0037)

---

## 🖼️ Visualizations

All visualizations are saved in the `images/` directory:

1. **01_class_distribution.png** - Training/Test set class distribution
2. **02_feature_correlation.png** - Feature correlation with target variable
3. **03_correlation_matrix.png** - Feature correlation heatmap
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
5. **05_baseline_roc_curve.png** - Baseline ROC curve
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
7. **07_baseline_feature_importance.png** - Baseline feature importance
8. **08_cross_validation.png** - Cross-validation score distribution
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
10. **10_tuned_roc_curve.png** - Tuned ROC curve
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
12. **12_tuned_feature_importance.png** - Tuned feature importance
13. **13_model_comparison.png** - Baseline vs Tuned comparison

---

## 🚀 Usage Example

```python
import joblib
import pandas as pd
import numpy as np

# Load model and scaler
model = joblib.load('twitter_bot_detection_v2.pkl')
scaler = joblib.load('twitter_scaler_v2.pkl')

# Prepare your data (example)
data = {
    'has_custom_cover_image': 0.5,
    'description_length': 0.5,
    'favourites_count': 0.5,
    'followers_count': 0.5,
    'friends_count': 0.5,
    'followers_to_friends_ratio': 0.5,
    'has_location': 0.5,
    'username_digit_count': 0.5,
    'username_length': 0.5,
    'statuses_count': 0.5,
    'is_verified': 0.5,
    'account_age_days': 0.5,
}

# Create DataFrame
df = pd.DataFrame([data])

# Scale features
df_scaled = scaler.transform(df)

# Predict
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]

print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")
```

---

## 📋 Confusion Matrix Breakdown

### Tuned Model (Test Set)

```
                Predicted
              Human    Bot
Actual Human     4676     309
       Bot        611    1891
```

- **True Negatives (TN):** 4,676 (Correctly identified humans)
- **False Positives (FP):** 309 (Humans incorrectly classified as bots)
- **False Negatives (FN):** 611 (Bots incorrectly classified as humans)
- **True Positives (TP):** 1,891 (Correctly identified bots)

---

## 🔍 Model Interpretation

### Strengths

- High ROC-AUC score (0.9354) indicates excellent discrimination capability
- Balanced precision and recall for both classes
- Robust cross-validation performance

### Key Insights

1. Top features drive bot classification effectively
2. GridSearchCV improved performance over baseline by 0.43%
3. Model generalizes well on unseen test data

---

## 📝 Notes

- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
- **Missing Values:** Filled with 0 during preprocessing
- **Class Balance:** Imbalanced dataset
- **Model Type:** Ensemble method resistant to overfitting

---

## 🔄 Model Updates

To retrain the model:

1. Place new training data in `../data/train_twitter.csv`
2. Run the training notebook: `5_enhanced_training.ipynb`
3. Update this README with new metrics

---

## 📧 Contact & Support

For questions or issues regarding this model, please refer to the main project documentation.

---

**Generated:** 2025-11-27 12:08:54
**Notebook:** `5_enhanced_training.ipynb`
**Platform:** Twitter