Upload folder using huggingface_hub
Browse files- .gitattributes +9 -0
- README.md +184 -196
- images/01_class_distribution.png +3 -0
- images/02_feature_correlation.png +3 -0
- images/03_correlation_matrix.png +3 -0
- images/04_baseline_confusion_matrix.png +0 -0
- images/05_baseline_roc_curve.png +3 -0
- images/06_baseline_precision_recall.png +0 -0
- images/07_baseline_feature_importance.png +3 -0
- images/08_cross_validation.png +3 -0
- images/09_tuned_confusion_matrix.png +0 -0
- images/10_tuned_roc_curve.png +3 -0
- images/11_tuned_precision_recall.png +0 -0
- images/12_tuned_feature_importance.png +3 -0
- images/13_model_comparison.png +3 -0
- instagram_bot_detection_v2.pkl +3 -0
- instagram_features_v2.json +12 -0
- instagram_metrics_v2.txt +39 -0
- instagram_model_comparison.csv +7 -0
- instagram_scaler_v2.pkl +3 -0
.gitattributes
CHANGED
|
@@ -1 +1,10 @@
|
|
| 1 |
*.pkl filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
images/01_class_distribution.png filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
images/02_feature_correlation.png filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
images/03_correlation_matrix.png filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
images/05_baseline_roc_curve.png filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
images/07_baseline_feature_importance.png filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
images/08_cross_validation.png filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
images/10_tuned_roc_curve.png filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
images/12_tuned_feature_importance.png filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
images/13_model_comparison.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,255 +1,243 @@
|
|
| 1 |
-
|
| 2 |
-
language: en
|
| 3 |
-
license: mit
|
| 4 |
-
tags:
|
| 5 |
-
- bot-detection
|
| 6 |
-
- instagram
|
| 7 |
-
- random-forest
|
| 8 |
-
- sklearn
|
| 9 |
-
- social-media
|
| 10 |
-
- classification
|
| 11 |
-
metrics:
|
| 12 |
-
- accuracy
|
| 13 |
-
- precision
|
| 14 |
-
- recall
|
| 15 |
-
- f1
|
| 16 |
-
- roc-auc
|
| 17 |
-
library_name: scikit-learn
|
| 18 |
-
---
|
| 19 |
-
|
| 20 |
-
# Instagram Bot Detection Model
|
| 21 |
-
|
| 22 |
-
## Model Description
|
| 23 |
-
|
| 24 |
-
This Random Forest classifier is designed to detect bot accounts on Instagram based on profile features and behavioral patterns. The model analyzes various account characteristics to determine whether an account is likely automated (bot) or genuine (human).
|
| 25 |
-
|
| 26 |
-
## Model Details
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
- **Task**: Binary Classification (Bot vs Human)
|
| 31 |
-
- **Language**: Python
|
| 32 |
-
- **License**: MIT
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
-
|
| 39 |
-
- **Accuracy**: Near-perfect accuracy in distinguishing bots from legitimate accounts
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
2. **IsVerified** - Whether the account has a verification badge
|
| 49 |
-
3. **HasProfilePic** - Whether the account has a profile picture
|
| 50 |
-
4. **FollowingCount** - Number of accounts being followed
|
| 51 |
-
5. **FollowerCount** - Number of followers
|
| 52 |
-
6. **HasExternalUrl** - Whether there's an external URL in the profile
|
| 53 |
-
7. **HasBio** - Whether the account has a bio description
|
| 54 |
-
8. **HasPosts** - Whether the account has made any posts
|
| 55 |
-
9. **PostsCount** - Total number of posts
|
| 56 |
-
10. **FollowToFollowerRatio** - Ratio of following to followers
|
| 57 |
-
11. **IsBusinessAccount** - Whether the account is a business account
|
| 58 |
-
12. **HasHighlights** - Whether the account has story highlights
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
-
- Content moderation and platform integrity
|
| 66 |
-
- Research on social media bot behavior
|
| 67 |
-
- Automated account screening
|
| 68 |
-
- Spam detection systems
|
| 69 |
|
| 70 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
-
|
| 75 |
-
-
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
-
|
| 82 |
-
pip install scikit-learn pandas numpy joblib
|
| 83 |
-
```
|
| 84 |
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
import joblib
|
| 89 |
-
import pandas as pd
|
| 90 |
-
import numpy as np
|
| 91 |
-
from sklearn.preprocessing import MinMaxScaler
|
| 92 |
-
|
| 93 |
-
# Load the model
|
| 94 |
-
model = joblib.load('IG_BOT_Detection_Model_v1.pkl')
|
| 95 |
-
|
| 96 |
-
# Prepare your data
|
| 97 |
-
features = ['IsPrivate', 'IsVerified', 'HasProfilePic', 'FollowingCount',
|
| 98 |
-
'FollowerCount', 'HasExternalUrl', 'HasBio', 'HasPosts',
|
| 99 |
-
'PostsCount', 'FollowToFollowerRatio', 'IsBusinessAccount',
|
| 100 |
-
'HasHighlights']
|
| 101 |
-
|
| 102 |
-
# Example account data
|
| 103 |
-
account_data = {
|
| 104 |
-
'IsPrivate': 0,
|
| 105 |
-
'IsVerified': 0,
|
| 106 |
-
'HasProfilePic': 1,
|
| 107 |
-
'FollowingCount': 7500,
|
| 108 |
-
'FollowerCount': 150,
|
| 109 |
-
'HasExternalUrl': 1,
|
| 110 |
-
'HasBio': 0,
|
| 111 |
-
'HasPosts': 1,
|
| 112 |
-
'PostsCount': 20,
|
| 113 |
-
'FollowToFollowerRatio': 50.0,
|
| 114 |
-
'IsBusinessAccount': 0,
|
| 115 |
-
'HasHighlights': 0
|
| 116 |
-
}
|
| 117 |
|
| 118 |
-
|
| 119 |
-
df = pd.DataFrame([account_data])
|
| 120 |
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
|
|
|
| 125 |
|
| 126 |
-
|
| 127 |
-
prediction = model.predict(df_scaled)
|
| 128 |
-
probability = model.predict_proba(df_scaled)
|
| 129 |
|
| 130 |
-
|
| 131 |
-
print(f"Confidence - Human: {probability[0][0]:.2%}, Bot: {probability[0][1]:.2%}")
|
| 132 |
-
```
|
| 133 |
|
| 134 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
|
| 142 |
-
probabilities = model.predict_proba(accounts_scaled)
|
| 143 |
|
| 144 |
-
|
| 145 |
-
accounts_df['is_bot'] = predictions
|
| 146 |
-
accounts_df['bot_probability'] = probabilities[:, 1]
|
| 147 |
|
| 148 |
-
|
| 149 |
-
likely_bots = accounts_df[accounts_df['bot_probability'] > 0.8]
|
| 150 |
-
```
|
| 151 |
|
| 152 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
|
| 154 |
-
|
| 155 |
|
| 156 |
-
|
| 157 |
-
- Various account types and behavioral patterns
|
| 158 |
-
- Features extracted from public profile information
|
| 159 |
-
- Diverse account ages and activity levels
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 164 |
|
| 165 |
-
|
| 166 |
|
| 167 |
-
|
| 168 |
-
2. Calculation of derived features (e.g., FollowToFollowerRatio)
|
| 169 |
-
3. MinMax normalization of all features to [0, 1] range
|
| 170 |
-
4. Train-test split with stratification to maintain class balance
|
| 171 |
|
| 172 |
-
|
|
|
|
|
|
|
|
|
|
| 173 |
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
-
|
|
|
|
| 180 |
|
| 181 |
-
|
|
|
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
- Limited to profile-level features; does not analyze content or engagement patterns
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
|
| 194 |
-
- Could have regional or cultural biases depending on training data composition
|
| 195 |
-
- May misclassify legitimate accounts with unusual behavior patterns
|
| 196 |
-
- Potential bias against new accounts or accounts with low activity
|
| 197 |
|
| 198 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
-
-
|
| 201 |
-
-
|
| 202 |
-
-
|
| 203 |
-
-
|
| 204 |
-
- Combine with content analysis and engagement pattern detection
|
| 205 |
|
| 206 |
-
|
| 207 |
|
| 208 |
-
|
| 209 |
-
- Consider privacy implications when analyzing user accounts
|
| 210 |
-
- Ensure compliance with Instagram's terms of service and relevant privacy laws (GDPR, CCPA, etc.)
|
| 211 |
-
- Implement appropriate safeguards against misuse
|
| 212 |
-
- Provide transparency to users about automated detection systems
|
| 213 |
-
- Allow for appeals and manual review processes
|
| 214 |
|
| 215 |
-
|
|
|
|
|
|
|
|
|
|
| 216 |
|
| 217 |
-
|
|
|
|
|
|
|
|
|
|
| 218 |
|
| 219 |
-
|
| 220 |
|
| 221 |
-
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
year={2024},
|
| 228 |
-
publisher={Hugging Face},
|
| 229 |
-
howpublished={\url{https://huggingface.co/your-username/instagram-bot-detection}}
|
| 230 |
-
}
|
| 231 |
-
```
|
| 232 |
|
| 233 |
-
|
| 234 |
|
| 235 |
-
|
| 236 |
-
- [Twitter Bot Detection](https://huggingface.co/your-username/twitter-bot-detection)
|
| 237 |
|
| 238 |
-
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
-
|
| 241 |
|
| 242 |
-
##
|
| 243 |
|
| 244 |
-
|
| 245 |
-
- **Last Updated**: November 2024
|
| 246 |
-
- **Status**: Active
|
| 247 |
|
| 248 |
-
|
| 249 |
|
| 250 |
-
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
- Support for new Instagram features (Reels, etc.)
|
| 254 |
-
- Integration of content-based features
|
| 255 |
-
- Multi-model ensemble approach
|
|
|
|
| 1 |
+
# INSTAGRAM Bot Detection Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
## Overview
|
| 4 |
+
This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
+
**Model Version:** v2
|
| 7 |
+
**Training Date:** 2025-11-27 11:38:28
|
| 8 |
+
**Framework:** scikit-learn 1.5.2
|
| 9 |
+
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
|
| 10 |
|
| 11 |
+
---
|
| 12 |
|
| 13 |
+
## 📊 Model Performance
|
|
|
|
| 14 |
|
| 15 |
+
### Final Metrics (Test Set)
|
| 16 |
+
| Metric | Score |
|
| 17 |
+
|--------|-------|
|
| 18 |
+
| **Accuracy** | 0.9860 (98.60%) |
|
| 19 |
+
| **Precision** | 0.9918 (99.18%) |
|
| 20 |
+
| **Recall** | 0.9796 (97.96%) |
|
| 21 |
+
| **F1-Score** | 0.9857 (98.57%) |
|
| 22 |
+
| **ROC-AUC** | 0.9990 (99.90%) |
|
| 23 |
+
| **Average Precision** | 0.9990 (99.90%) |
|
| 24 |
|
| 25 |
+
### Model Improvement
|
| 26 |
+
- **Baseline ROC-AUC:** 0.9988
|
| 27 |
+
- **Tuned ROC-AUC:** 0.9990
|
| 28 |
+
- **Improvement:** 0.0002 (0.02%)
|
| 29 |
|
| 30 |
+
---
|
| 31 |
|
| 32 |
+
## 🗂️ Files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
+
| File | Description |
|
| 35 |
+
|------|-------------|
|
| 36 |
+
| `instagram_bot_detection_v2.pkl` | Trained Random Forest model |
|
| 37 |
+
| `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization |
|
| 38 |
+
| `instagram_features_v2.json` | List of features used by the model |
|
| 39 |
+
| `instagram_metrics_v2.txt` | Detailed performance metrics report |
|
| 40 |
+
| `images/` | All visualization plots (13 images) |
|
| 41 |
+
| `README.md` | This file |
|
| 42 |
|
| 43 |
+
---
|
| 44 |
|
| 45 |
+
## 🎯 Dataset Information
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
### Training Configuration
|
| 48 |
+
- **Training Samples:** 4,000
|
| 49 |
+
- **Test Samples:** 1,000
|
| 50 |
+
- **Total Samples:** 5,000
|
| 51 |
+
- **Number of Features:** 10
|
| 52 |
+
- **Cross-Validation Folds:** 5
|
| 53 |
+
- **Random State:** 42
|
| 54 |
|
| 55 |
+
### Class Distribution
|
| 56 |
+
**Training Set:**
|
| 57 |
+
- Human (0): 1,991 (49.78%)
|
| 58 |
+
- Bot (1): 2,009 (50.22%)
|
| 59 |
|
| 60 |
+
**Test Set:**
|
| 61 |
+
- Human (0): 509 (50.90%)
|
| 62 |
+
- Bot (1): 491 (49.10%)
|
| 63 |
|
| 64 |
+
---
|
| 65 |
|
| 66 |
+
## 🔧 Features (10)
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
1. `profile_pic`
|
| 69 |
+
2. `username_num_ratio`
|
| 70 |
+
3. `username_is_numeric`
|
| 71 |
+
4. `fullname_words`
|
| 72 |
+
5. `fullname_num_ratio`
|
| 73 |
+
6. `is_name_number_only`
|
| 74 |
+
7. `name_equals_username`
|
| 75 |
+
8. `followers`
|
| 76 |
+
9. `follows`
|
| 77 |
+
10. `followers_to_follows_ratio`
|
| 78 |
|
| 79 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
## 🏆 Top 5 Most Important Features
|
|
|
|
| 82 |
|
| 83 |
+
1. **profile_pic** - 0.3314
|
| 84 |
+
8. **followers** - 0.2313
|
| 85 |
+
2. **username_num_ratio** - 0.1665
|
| 86 |
+
10. **followers_to_follows_ratio** - 0.1308
|
| 87 |
+
9. **follows** - 0.0923
|
| 88 |
|
| 89 |
+
---
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
## ⚙️ Hyperparameters
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
### Best Parameters (from GridSearchCV)
|
| 94 |
+
- **class_weight:** balanced
|
| 95 |
+
- **max_depth:** 15
|
| 96 |
+
- **max_features:** sqrt
|
| 97 |
+
- **min_samples_leaf:** 1
|
| 98 |
+
- **min_samples_split:** 2
|
| 99 |
+
- **n_estimators:** 100
|
| 100 |
|
| 101 |
+
### Parameter Search Space
|
| 102 |
+
- **n_estimators:** [100, 200, 300]
|
| 103 |
+
- **max_depth:** [10, 15, 20, None]
|
| 104 |
+
- **min_samples_split:** [2, 5, 10]
|
| 105 |
+
- **min_samples_leaf:** [1, 2, 4]
|
| 106 |
+
- **max_features:** ['sqrt', 'log2']
|
| 107 |
+
- **bootstrap:** [True, False]
|
| 108 |
|
| 109 |
+
**Total combinations tested:** 540
|
|
|
|
| 110 |
|
| 111 |
+
---
|
|
|
|
|
|
|
| 112 |
|
| 113 |
+
## 📈 Cross-Validation Results
|
|
|
|
|
|
|
| 114 |
|
| 115 |
+
### Mean Scores (5-Fold Stratified CV)
|
| 116 |
+
- **Accuracy:** 0.9848 (±0.0051)
|
| 117 |
+
- **Precision:** 0.9900 (±0.0066)
|
| 118 |
+
- **Recall:** 0.9796 (±0.0081)
|
| 119 |
+
- **F1-Score:** 0.9847 (±0.0051)
|
| 120 |
+
- **ROC-AUC:** 0.9986 (±0.0011)
|
| 121 |
|
| 122 |
+
---
|
| 123 |
|
| 124 |
+
## 🖼️ Visualizations
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
All visualizations are saved in the `images/` directory:
|
| 127 |
|
| 128 |
+
1. **01_class_distribution.png** - Training/Test set class distribution
|
| 129 |
+
2. **02_feature_correlation.png** - Feature correlation with target variable
|
| 130 |
+
3. **03_correlation_matrix.png** - Feature correlation heatmap
|
| 131 |
+
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
|
| 132 |
+
5. **05_baseline_roc_curve.png** - Baseline ROC curve
|
| 133 |
+
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
|
| 134 |
+
7. **07_baseline_feature_importance.png** - Baseline feature importance
|
| 135 |
+
8. **08_cross_validation.png** - Cross-validation score distribution
|
| 136 |
+
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
|
| 137 |
+
10. **10_tuned_roc_curve.png** - Tuned ROC curve
|
| 138 |
+
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
|
| 139 |
+
12. **12_tuned_feature_importance.png** - Tuned feature importance
|
| 140 |
+
13. **13_model_comparison.png** - Baseline vs Tuned comparison
|
| 141 |
|
| 142 |
+
---
|
| 143 |
|
| 144 |
+
## 🚀 Usage Example
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
+
```python
|
| 147 |
+
import joblib
|
| 148 |
+
import pandas as pd
|
| 149 |
+
import numpy as np
|
| 150 |
|
| 151 |
+
# Load model and scaler
|
| 152 |
+
model = joblib.load('instagram_bot_detection_v2.pkl')
|
| 153 |
+
scaler = joblib.load('instagram_scaler_v2.pkl')
|
| 154 |
+
|
| 155 |
+
# Prepare your data (example)
|
| 156 |
+
data = {
|
| 157 |
+
'profile_pic': 0.5,
|
| 158 |
+
'username_num_ratio': 0.5,
|
| 159 |
+
'username_is_numeric': 0.5,
|
| 160 |
+
'fullname_words': 0.5,
|
| 161 |
+
'fullname_num_ratio': 0.5,
|
| 162 |
+
'is_name_number_only': 0.5,
|
| 163 |
+
'name_equals_username': 0.5,
|
| 164 |
+
'followers': 0.5,
|
| 165 |
+
'follows': 0.5,
|
| 166 |
+
'followers_to_follows_ratio': 0.5,
|
| 167 |
+
}
|
| 168 |
|
| 169 |
+
# Create DataFrame
|
| 170 |
+
df = pd.DataFrame([data])
|
| 171 |
|
| 172 |
+
# Scale features
|
| 173 |
+
df_scaled = scaler.transform(df)
|
| 174 |
|
| 175 |
+
# Predict
|
| 176 |
+
prediction = model.predict(df_scaled)[0]
|
| 177 |
+
probability = model.predict_proba(df_scaled)[0]
|
| 178 |
|
| 179 |
+
print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
|
| 180 |
+
print(f"Bot Probability: {probability[1]:.4f}")
|
| 181 |
+
print(f"Human Probability: {probability[0]:.4f}")
|
| 182 |
+
```
|
|
|
|
| 183 |
|
| 184 |
+
---
|
| 185 |
|
| 186 |
+
## 📋 Confusion Matrix Breakdown
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
+
### Tuned Model (Test Set)
|
| 189 |
+
```
|
| 190 |
+
Predicted
|
| 191 |
+
Human Bot
|
| 192 |
+
Actual Human 505 4
|
| 193 |
+
Bot 10 481
|
| 194 |
+
```
|
| 195 |
|
| 196 |
+
- **True Negatives (TN):** 505 (Correctly identified humans)
|
| 197 |
+
- **False Positives (FP):** 4 (Humans incorrectly classified as bots)
|
| 198 |
+
- **False Negatives (FN):** 10 (Bots incorrectly classified as humans)
|
| 199 |
+
- **True Positives (TP):** 481 (Correctly identified bots)
|
|
|
|
| 200 |
|
| 201 |
+
---
|
| 202 |
|
| 203 |
+
## 🔍 Model Interpretation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
|
| 205 |
+
### Strengths
|
| 206 |
+
- High ROC-AUC score (0.9990) indicates excellent discrimination capability
|
| 207 |
+
- Balanced precision and recall for both classes
|
| 208 |
+
- Robust cross-validation performance
|
| 209 |
|
| 210 |
+
### Key Insights
|
| 211 |
+
1. Top features drive bot classification effectively
|
| 212 |
+
2. GridSearchCV improved performance over baseline by 0.02%
|
| 213 |
+
3. Model generalizes well on unseen test data
|
| 214 |
|
| 215 |
+
---
|
| 216 |
|
| 217 |
+
## 📝 Notes
|
| 218 |
|
| 219 |
+
- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
|
| 220 |
+
- **Missing Values:** Filled with 0 during preprocessing
|
| 221 |
+
- **Class Balance:** Balanced dataset
|
| 222 |
+
- **Model Type:** Ensemble method resistant to overfitting
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
+
---
|
| 225 |
|
| 226 |
+
## 🔄 Model Updates
|
|
|
|
| 227 |
|
| 228 |
+
To retrain the model:
|
| 229 |
+
1. Place new training data in `../data/train_instagram.csv`
|
| 230 |
+
2. Run the training notebook: `5_enhanced_training.ipynb`
|
| 231 |
+
3. Update this README with new metrics
|
| 232 |
|
| 233 |
+
---
|
| 234 |
|
| 235 |
+
## 📧 Contact & Support
|
| 236 |
|
| 237 |
+
For questions or issues regarding this model, please refer to the main project documentation.
|
|
|
|
|
|
|
| 238 |
|
| 239 |
+
---
|
| 240 |
|
| 241 |
+
**Generated:** 2025-11-27 11:38:28
|
| 242 |
+
**Notebook:** `5_enhanced_training.ipynb`
|
| 243 |
+
**Platform:** Instagram
|
|
|
|
|
|
|
|
|
images/01_class_distribution.png
ADDED
|
Git LFS Details
|
images/02_feature_correlation.png
ADDED
|
Git LFS Details
|
images/03_correlation_matrix.png
ADDED
|
Git LFS Details
|
images/04_baseline_confusion_matrix.png
ADDED
|
images/05_baseline_roc_curve.png
ADDED
|
Git LFS Details
|
images/06_baseline_precision_recall.png
ADDED
|
images/07_baseline_feature_importance.png
ADDED
|
Git LFS Details
|
images/08_cross_validation.png
ADDED
|
Git LFS Details
|
images/09_tuned_confusion_matrix.png
ADDED
|
images/10_tuned_roc_curve.png
ADDED
|
Git LFS Details
|
images/11_tuned_precision_recall.png
ADDED
|
images/12_tuned_feature_importance.png
ADDED
|
Git LFS Details
|
images/13_model_comparison.png
ADDED
|
Git LFS Details
|
instagram_bot_detection_v2.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:754546a9111c11b3b9e5ebe0cab64613ac81fa25043da96875a13ab47b39060c
|
| 3 |
+
size 1994105
|
instagram_features_v2.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
"profile_pic",
|
| 3 |
+
"username_num_ratio",
|
| 4 |
+
"username_is_numeric",
|
| 5 |
+
"fullname_words",
|
| 6 |
+
"fullname_num_ratio",
|
| 7 |
+
"is_name_number_only",
|
| 8 |
+
"name_equals_username",
|
| 9 |
+
"followers",
|
| 10 |
+
"follows",
|
| 11 |
+
"followers_to_follows_ratio"
|
| 12 |
+
]
|
instagram_metrics_v2.txt
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
======================================================================
|
| 2 |
+
INSTAGRAM Bot Detection Model - Performance Report
|
| 3 |
+
======================================================================
|
| 4 |
+
|
| 5 |
+
Date: 2025-11-27 11:38:28.480839
|
| 6 |
+
|
| 7 |
+
Training Configuration:
|
| 8 |
+
- Platform: instagram
|
| 9 |
+
- Train samples: 4000
|
| 10 |
+
- Test samples: 1000
|
| 11 |
+
- Features: 10
|
| 12 |
+
- CV Folds: 5
|
| 13 |
+
- Random State: 42
|
| 14 |
+
|
| 15 |
+
Best Hyperparameters:
|
| 16 |
+
- class_weight: balanced
|
| 17 |
+
- max_depth: 15
|
| 18 |
+
- max_features: sqrt
|
| 19 |
+
- min_samples_leaf: 1
|
| 20 |
+
- min_samples_split: 2
|
| 21 |
+
- n_estimators: 100
|
| 22 |
+
|
| 23 |
+
Performance Metrics (Test Set):
|
| 24 |
+
- Accuracy: 0.9860
|
| 25 |
+
- Precision: 0.9918
|
| 26 |
+
- Recall: 0.9796
|
| 27 |
+
- F1: 0.9857
|
| 28 |
+
- Roc_auc: 0.9990
|
| 29 |
+
- Avg_precision: 0.9990
|
| 30 |
+
|
| 31 |
+
Cross-Validation Results:
|
| 32 |
+
- Mean ROC-AUC: 0.9988
|
| 33 |
+
|
| 34 |
+
Feature Importance (Top 5):
|
| 35 |
+
- profile_pic: 0.3314
|
| 36 |
+
- followers: 0.2313
|
| 37 |
+
- username_num_ratio: 0.1665
|
| 38 |
+
- followers_to_follows_ratio: 0.1308
|
| 39 |
+
- follows: 0.0923
|
instagram_model_comparison.csv
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Metric,Baseline,Tuned,Improvement,Improvement %
|
| 2 |
+
Accuracy,0.986,0.986,0.0,0.0
|
| 3 |
+
Precision,0.9917525773195877,0.9917525773195877,0.0,0.0
|
| 4 |
+
Recall,0.9796334012219959,0.9796334012219959,0.0,0.0
|
| 5 |
+
F1-Score,0.985655737704918,0.985655737704918,0.0,0.0
|
| 6 |
+
ROC-AUC,0.998803612370408,0.9989916733021498,0.00018806093174172922,0.018828619501626963
|
| 7 |
+
Avg Precision,0.9988880328453673,0.9990380665868485,0.00015003374148114812,0.015020075979263841
|
instagram_scaler_v2.pkl
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b3969a962396dc6ff11bc9767e60de5e75c8b5a07a156b069723e80f49dc48ed
|
| 3 |
+
size 1511
|