Upload folder using huggingface_hub

df39e77 verified about 1 month ago

6.8 kB

	---
	language: "en"
	license: "apache-2.0"
	library_name: "scikit-learn"
	tags:
	- "bot-detection"
	- "twitter"
	- "classification"
	- "scikit-learn"
	- "random-forest"
	---

	# TWITTER Bot Detection Model

	## Overview

	This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.

	Model Version: v2
	Training Date: 2025-11-27 12:08:54
	Framework: scikit-learn 1.5.2
	Algorithm: Random Forest Classifier with GridSearchCV Hyperparameter Tuning

	---

	## 📊 Model Performance

	### Final Metrics (Test Set)

	\| Metric \| Score \|
	\| --------------------- \| --------------- \|
	\| Accuracy \| 0.8771 (87.71%) \|
	\| Precision \| 0.8595 (85.95%) \|
	\| Recall \| 0.7558 (75.58%) \|
	\| F1-Score \| 0.8043 (80.43%) \|
	\| ROC-AUC \| 0.9354 (93.54%) \|
	\| Average Precision \| 0.9008 (90.08%) \|

	### Model Improvement

	- Baseline ROC-AUC: 0.9314
	- Tuned ROC-AUC: 0.9354
	- Improvement: 0.0040 (0.43%)

	---

	## 🗂️ Files

	\| File \| Description \|
	\| ------------------------------ \| -------------------------------------- \|
	\| `twitter_bot_detection_v2.pkl` \| Trained Random Forest model \|
	\| `twitter_scaler_v2.pkl` \| MinMaxScaler for feature normalization \|
	\| `twitter_features_v2.json` \| List of features used by the model \|
	\| `twitter_metrics_v2.txt` \| Detailed performance metrics report \|
	\| `images/` \| All visualization plots (13 images) \|
	\| `README.md` \| This file \|

	---

	## 🎯 Dataset Information

	### Training Configuration

	- Training Samples: 29,951
	- Test Samples: 7,487
	- Total Samples: 37,438
	- Number of Features: 12
	- Cross-Validation Folds: 5
	- Random State: 42

	### Class Distribution

	Training Set:

	- Human (0): 20,028 (66.87%)
	- Bot (1): 9,923 (33.13%)

	Test Set:

	- Human (0): 4,985 (66.58%)
	- Bot (1): 2,502 (33.42%)

	---

	## 🔧 Features (12)

	1. `has_custom_cover_image`
	2. `description_length`
	3. `favourites_count`
	4. `followers_count`
	5. `friends_count`
	6. `followers_to_friends_ratio`
	7. `has_location`
	8. `username_digit_count`
	9. `username_length`
	10. `statuses_count`
	11. `is_verified`
	12. `account_age_days`

	---

	## 🏆 Top 5 Most Important Features

	4. followers_count - 0.1895
	5. favourites_count - 0.1813
	6. friends_count - 0.1494
	7. statuses_count - 0.1244
	8. account_age_days - 0.1010

	---

	## ⚙️ Hyperparameters

	### Best Parameters (from GridSearchCV)

	- class_weight: balanced
	- max_depth: 20
	- max_features: sqrt
	- min_samples_leaf: 1
	- min_samples_split: 2
	- n_estimators: 300

	### Parameter Search Space

	- n_estimators: [100, 200, 300]
	- max_depth: [10, 15, 20, None]
	- min_samples_split: [2, 5, 10]
	- min_samples_leaf: [1, 2, 4]
	- max_features: ['sqrt', 'log2']
	- bootstrap: [True, False]

	Total combinations tested: 540

	---

	## 📈 Cross-Validation Results

	### Mean Scores (5-Fold Stratified CV)

	- Accuracy: 0.8750 (±0.0053)
	- Precision: 0.8658 (±0.0089)
	- Recall: 0.7368 (±0.0113)
	- F1-Score: 0.7961 (±0.0092)
	- ROC-AUC: 0.9325 (±0.0037)

	---

	## 🖼️ Visualizations

	All visualizations are saved in the `images/` directory:

	1. 01_class_distribution.png - Training/Test set class distribution
	2. 02_feature_correlation.png - Feature correlation with target variable
	3. 03_correlation_matrix.png - Feature correlation heatmap
	4. 04_baseline_confusion_matrix.png - Baseline model confusion matrix
	5. 05_baseline_roc_curve.png - Baseline ROC curve
	6. 06_baseline_precision_recall.png - Baseline Precision-Recall curve
	7. 07_baseline_feature_importance.png - Baseline feature importance
	8. 08_cross_validation.png - Cross-validation score distribution
	9. 09_tuned_confusion_matrix.png - Tuned model confusion matrix
	10. 10_tuned_roc_curve.png - Tuned ROC curve
	11. 11_tuned_precision_recall.png - Tuned Precision-Recall curve
	12. 12_tuned_feature_importance.png - Tuned feature importance
	13. 13_model_comparison.png - Baseline vs Tuned comparison

	---

	## 🚀 Usage Example

	```python
	import joblib
	import pandas as pd
	import numpy as np

	# Load model and scaler
	model = joblib.load('twitter_bot_detection_v2.pkl')
	scaler = joblib.load('twitter_scaler_v2.pkl')

	# Prepare your data (example)
	data = {
	'has_custom_cover_image': 0.5,
	'description_length': 0.5,
	'favourites_count': 0.5,
	'followers_count': 0.5,
	'friends_count': 0.5,
	'followers_to_friends_ratio': 0.5,
	'has_location': 0.5,
	'username_digit_count': 0.5,
	'username_length': 0.5,
	'statuses_count': 0.5,
	'is_verified': 0.5,
	'account_age_days': 0.5,
	}

	# Create DataFrame
	df = pd.DataFrame([data])

	# Scale features
	df_scaled = scaler.transform(df)

	# Predict
	prediction = model.predict(df_scaled)[0]
	probability = model.predict_proba(df_scaled)[0]

	print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
	print(f"Bot Probability: {probability[1]:.4f}")
	print(f"Human Probability: {probability[0]:.4f}")
	```

	---

	## 📋 Confusion Matrix Breakdown

	### Tuned Model (Test Set)

	```
	Predicted
	Human Bot
	Actual Human 4676 309
	Bot 611 1891
	```

	- True Negatives (TN): 4,676 (Correctly identified humans)
	- False Positives (FP): 309 (Humans incorrectly classified as bots)
	- False Negatives (FN): 611 (Bots incorrectly classified as humans)
	- True Positives (TP): 1,891 (Correctly identified bots)

	---

	## 🔍 Model Interpretation

	### Strengths

	- High ROC-AUC score (0.9354) indicates excellent discrimination capability
	- Balanced precision and recall for both classes
	- Robust cross-validation performance

	### Key Insights

	1. Top features drive bot classification effectively
	2. GridSearchCV improved performance over baseline by 0.43%
	3. Model generalizes well on unseen test data

	---

	## 📝 Notes

	- Feature Scaling: All features are scaled using MinMaxScaler to [0, 1] range
	- Missing Values: Filled with 0 during preprocessing
	- Class Balance: Imbalanced dataset
	- Model Type: Ensemble method resistant to overfitting

	---

	## 🔄 Model Updates

	To retrain the model:

	1. Place new training data in `../data/train_twitter.csv`
	2. Run the training notebook: `5_enhanced_training.ipynb`
	3. Update this README with new metrics

	---

	## 📧 Contact & Support

	For questions or issues regarding this model, please refer to the main project documentation.

	---

	Generated: 2025-11-27 12:08:54
	Notebook: `5_enhanced_training.ipynb`
	Platform: Twitter