nahiar commited on
Commit
2aa2ca7
·
verified ·
1 Parent(s): 35d1a1d

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +6 -238
README.md CHANGED
@@ -1,243 +1,11 @@
1
- # INSTAGRAM Bot Detection Model
2
-
3
- ## Overview
4
- This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.
5
-
6
- **Model Version:** v2
7
- **Training Date:** 2025-11-27 11:38:28
8
- **Framework:** scikit-learn 1.5.2
9
- **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
10
-
11
- ---
12
-
13
- ## 📊 Model Performance
14
-
15
- ### Final Metrics (Test Set)
16
- | Metric | Score |
17
- |--------|-------|
18
- | **Accuracy** | 0.9860 (98.60%) |
19
- | **Precision** | 0.9918 (99.18%) |
20
- | **Recall** | 0.9796 (97.96%) |
21
- | **F1-Score** | 0.9857 (98.57%) |
22
- | **ROC-AUC** | 0.9990 (99.90%) |
23
- | **Average Precision** | 0.9990 (99.90%) |
24
-
25
- ### Model Improvement
26
- - **Baseline ROC-AUC:** 0.9988
27
- - **Tuned ROC-AUC:** 0.9990
28
- - **Improvement:** 0.0002 (0.02%)
29
-
30
- ---
31
-
32
- ## 🗂️ Files
33
-
34
- | File | Description |
35
- |------|-------------|
36
- | `instagram_bot_detection_v2.pkl` | Trained Random Forest model |
37
- | `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization |
38
- | `instagram_features_v2.json` | List of features used by the model |
39
- | `instagram_metrics_v2.txt` | Detailed performance metrics report |
40
- | `images/` | All visualization plots (13 images) |
41
- | `README.md` | This file |
42
-
43
- ---
44
-
45
- ## 🎯 Dataset Information
46
-
47
- ### Training Configuration
48
- - **Training Samples:** 4,000
49
- - **Test Samples:** 1,000
50
- - **Total Samples:** 5,000
51
- - **Number of Features:** 10
52
- - **Cross-Validation Folds:** 5
53
- - **Random State:** 42
54
-
55
- ### Class Distribution
56
- **Training Set:**
57
- - Human (0): 1,991 (49.78%)
58
- - Bot (1): 2,009 (50.22%)
59
-
60
- **Test Set:**
61
- - Human (0): 509 (50.90%)
62
- - Bot (1): 491 (49.10%)
63
-
64
- ---
65
-
66
- ## 🔧 Features (10)
67
-
68
- 1. `profile_pic`
69
- 2. `username_num_ratio`
70
- 3. `username_is_numeric`
71
- 4. `fullname_words`
72
- 5. `fullname_num_ratio`
73
- 6. `is_name_number_only`
74
- 7. `name_equals_username`
75
- 8. `followers`
76
- 9. `follows`
77
- 10. `followers_to_follows_ratio`
78
-
79
- ---
80
-
81
- ## 🏆 Top 5 Most Important Features
82
-
83
- 1. **profile_pic** - 0.3314
84
- 8. **followers** - 0.2313
85
- 2. **username_num_ratio** - 0.1665
86
- 10. **followers_to_follows_ratio** - 0.1308
87
- 9. **follows** - 0.0923
88
-
89
- ---
90
-
91
- ## ⚙️ Hyperparameters
92
-
93
- ### Best Parameters (from GridSearchCV)
94
- - **class_weight:** balanced
95
- - **max_depth:** 15
96
- - **max_features:** sqrt
97
- - **min_samples_leaf:** 1
98
- - **min_samples_split:** 2
99
- - **n_estimators:** 100
100
-
101
- ### Parameter Search Space
102
- - **n_estimators:** [100, 200, 300]
103
- - **max_depth:** [10, 15, 20, None]
104
- - **min_samples_split:** [2, 5, 10]
105
- - **min_samples_leaf:** [1, 2, 4]
106
- - **max_features:** ['sqrt', 'log2']
107
- - **bootstrap:** [True, False]
108
-
109
- **Total combinations tested:** 540
110
-
111
- ---
112
-
113
- ## 📈 Cross-Validation Results
114
-
115
- ### Mean Scores (5-Fold Stratified CV)
116
- - **Accuracy:** 0.9848 (±0.0051)
117
- - **Precision:** 0.9900 (±0.0066)
118
- - **Recall:** 0.9796 (±0.0081)
119
- - **F1-Score:** 0.9847 (±0.0051)
120
- - **ROC-AUC:** 0.9986 (±0.0011)
121
-
122
  ---
123
-
124
- ## 🖼️ Visualizations
125
-
126
- All visualizations are saved in the `images/` directory:
127
-
128
- 1. **01_class_distribution.png** - Training/Test set class distribution
129
- 2. **02_feature_correlation.png** - Feature correlation with target variable
130
- 3. **03_correlation_matrix.png** - Feature correlation heatmap
131
- 4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
132
- 5. **05_baseline_roc_curve.png** - Baseline ROC curve
133
- 6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
134
- 7. **07_baseline_feature_importance.png** - Baseline feature importance
135
- 8. **08_cross_validation.png** - Cross-validation score distribution
136
- 9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
137
- 10. **10_tuned_roc_curve.png** - Tuned ROC curve
138
- 11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
139
- 12. **12_tuned_feature_importance.png** - Tuned feature importance
140
- 13. **13_model_comparison.png** - Baseline vs Tuned comparison
141
-
142
- ---
143
-
144
- ## 🚀 Usage Example
145
-
146
- ```python
147
- import joblib
148
- import pandas as pd
149
- import numpy as np
150
-
151
- # Load model and scaler
152
- model = joblib.load('instagram_bot_detection_v2.pkl')
153
- scaler = joblib.load('instagram_scaler_v2.pkl')
154
-
155
- # Prepare your data (example)
156
- data = {
157
- 'profile_pic': 0.5,
158
- 'username_num_ratio': 0.5,
159
- 'username_is_numeric': 0.5,
160
- 'fullname_words': 0.5,
161
- 'fullname_num_ratio': 0.5,
162
- 'is_name_number_only': 0.5,
163
- 'name_equals_username': 0.5,
164
- 'followers': 0.5,
165
- 'follows': 0.5,
166
- 'followers_to_follows_ratio': 0.5,
167
- }
168
-
169
- # Create DataFrame
170
- df = pd.DataFrame([data])
171
-
172
- # Scale features
173
- df_scaled = scaler.transform(df)
174
-
175
- # Predict
176
- prediction = model.predict(df_scaled)[0]
177
- probability = model.predict_proba(df_scaled)[0]
178
-
179
- print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
180
- print(f"Bot Probability: {probability[1]:.4f}")
181
- print(f"Human Probability: {probability[0]:.4f}")
182
- ```
183
-
184
  ---
185
 
186
- ## 📋 Confusion Matrix Breakdown
187
-
188
- ### Tuned Model (Test Set)
189
- ```
190
- Predicted
191
- Human Bot
192
- Actual Human 505 4
193
- Bot 10 481
194
- ```
195
 
196
- - **True Negatives (TN):** 505 (Correctly identified humans)
197
- - **False Positives (FP):** 4 (Humans incorrectly classified as bots)
198
- - **False Negatives (FN):** 10 (Bots incorrectly classified as humans)
199
- - **True Positives (TP):** 481 (Correctly identified bots)
200
-
201
- ---
202
-
203
- ## 🔍 Model Interpretation
204
-
205
- ### Strengths
206
- - High ROC-AUC score (0.9990) indicates excellent discrimination capability
207
- - Balanced precision and recall for both classes
208
- - Robust cross-validation performance
209
-
210
- ### Key Insights
211
- 1. Top features drive bot classification effectively
212
- 2. GridSearchCV improved performance over baseline by 0.02%
213
- 3. Model generalizes well on unseen test data
214
-
215
- ---
216
-
217
- ## 📝 Notes
218
-
219
- - **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
220
- - **Missing Values:** Filled with 0 during preprocessing
221
- - **Class Balance:** Balanced dataset
222
- - **Model Type:** Ensemble method resistant to overfitting
223
-
224
- ---
225
-
226
- ## 🔄 Model Updates
227
-
228
- To retrain the model:
229
- 1. Place new training data in `../data/train_instagram.csv`
230
- 2. Run the training notebook: `5_enhanced_training.ipynb`
231
- 3. Update this README with new metrics
232
-
233
- ---
234
-
235
- ## 📧 Contact & Support
236
-
237
- For questions or issues regarding this model, please refer to the main project documentation.
238
-
239
- ---
240
 
241
- **Generated:** 2025-11-27 11:38:28
242
- **Notebook:** `5_enhanced_training.ipynb`
243
- **Platform:** Instagram
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: "en"
3
+ license: "apache-2.0"
4
+ created: "2025-11-27T05:32:51.193018Z"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
+ # nahiar/instagram-bot-detection
 
 
 
 
 
 
 
 
8
 
9
+ A short description of this model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
+ -- Add details for: how to use, training data, limitations, citation, and license.