nahiar commited on
Commit
e7bbc2d
·
verified ·
1 Parent(s): a934b91

Initial upload (auto-create if missing)

Browse files
README.md ADDED
@@ -0,0 +1,290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ license: "apache-2.0"
4
+ created: "2025-12-30T05:32:51.193018Z"
5
+ tags:
6
+ - "bot-detection"
7
+ - "instagram"
8
+ - "classification"
9
+ ---
10
+
11
+ # INSTAGRAM Bot Detection Model
12
+
13
+ ## Overview
14
+
15
+ This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.
16
+
17
+ **Model Version:** v2
18
+ **Training Date:** 2025-12-30 11:38:28
19
+ **Framework:** scikit-learn 1.5.2
20
+ **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
21
+
22
+ ---
23
+
24
+ ## 📊 Model Performance
25
+
26
+ ### Final Metrics (Test Set)
27
+
28
+ | Metric | Score |
29
+ | --------------------- | --------------- |
30
+ | **Accuracy** | 0.9788 (97.88%) |
31
+ | **Precision** | 0.9923 (99.23%) |
32
+ | **Recall** | 0.9652 (96.52%) |
33
+ | **F1-Score** | 0.9786 (97.86%) |
34
+ | **ROC-AUC** | 0.9979 (99.79%) |
35
+ | **Average Precision** | 0.9923 (99.23%) |
36
+
37
+ ### Model Improvement
38
+
39
+ - **Baseline ROC-AUC:** 0.9982
40
+ - **Tuned ROC-AUC:** 0.9979
41
+
42
+ ---
43
+
44
+ ## 🗂️ Files
45
+
46
+ | File | Description |
47
+ | -------------------------------- | -------------------------------------- |
48
+ | `instagram_bot_detection_v2.pkl` | Trained Random Forest model |
49
+ | `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization |
50
+ | `instagram_features_v2.json` | List of features used by the model |
51
+ | `instagram_metrics_v2.txt` | Detailed performance metrics report |
52
+ | `images/` | All visualization plots (13 images) |
53
+ | `README.md` | This file |
54
+
55
+ ---
56
+
57
+ ## 🎯 Dataset Information
58
+
59
+ ### Training Configuration
60
+
61
+ - **Training Samples:** 4,000
62
+ - **Test Samples:** 1,000
63
+ - **Total Samples:** 5,000
64
+ - **Features:** 10
65
+ - **Cross-Validation Folds:** 5
66
+ - **Random State:** 42
67
+
68
+ ### Class Distribution
69
+
70
+ The model was trained with balanced class weights to handle any class imbalance in the dataset.
71
+
72
+ ---
73
+
74
+ ## 🔧 Model Architecture
75
+
76
+ ### Algorithm
77
+
78
+ **Random Forest Classifier** - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.
79
+
80
+ ### Hyperparameters (Tuned via GridSearchCV)
81
+
82
+ | Parameter | Value |
83
+ | ------------------- | -------- |
84
+ | `n_estimators` | 100 |
85
+ | `max_depth` | 15 |
86
+ | `min_samples_split` | 2 |
87
+ | `min_samples_leaf` | 1 |
88
+ | `max_features` | sqrt |
89
+ | `class_weight` | balanced |
90
+
91
+ ### Feature Preprocessing
92
+
93
+ - **Scaler:** MinMaxScaler (normalizes features to [0, 1] range)
94
+ - **Missing Values:** Handled during data preprocessing
95
+ - **Feature Engineering:** Custom features derived from account metadata
96
+
97
+ ---
98
+
99
+ ## 📈 Feature Importance
100
+
101
+ The model uses 10 features to detect bot accounts. Top 5 most important features:
102
+
103
+ | Rank | Feature | Importance | Description |
104
+ | ---- | ---------------------------- | ---------- | ------------------------------------------ |
105
+ | 1 | `profile_pic` | 0.3500 | Indicates if account has a profile picture |
106
+ | 2 | `followers` | 0.2587 | Number of followers |
107
+ | 3 | `username_num_ratio` | 0.2029 | Ratio of numbers in username |
108
+ | 4 | `follows` | 0.0997 | Ratio of followers to following count |
109
+ | 5 | `fullname_words` | 0.0345 | Number of accounts followed |
110
+
111
+ ### All Features
112
+
113
+ 1. `profile_pic` - Profile picture presence
114
+ 2. `username_num_ratio` - Numeric character ratio in username
115
+ 3. `username_is_numeric` - Username is entirely numeric
116
+ 4. `fullname_words` - Number of words in full name
117
+ 5. `fullname_num_ratio` - Numeric character ratio in full name
118
+ 6. `is_name_number_only` - Full name contains only numbers
119
+ 7. `name_equals_username` - Full name matches username
120
+ 8. `followers` - Follower count
121
+ 9. `follows` - Following count
122
+ 10. `followers_to_follows_ratio` - Follower/following ratio
123
+
124
+ ---
125
+
126
+ ## 🚀 Usage
127
+
128
+ ### Prerequisites
129
+
130
+ ```bash
131
+ pip install scikit-learn joblib numpy
132
+ ```
133
+
134
+ ### Loading the Model
135
+
136
+ ```python
137
+ import joblib
138
+ import numpy as np
139
+
140
+ # Load model and scaler
141
+ model = joblib.load('instagram_bot_detection_v2.pkl')
142
+ scaler = joblib.load('instagram_scaler_v2.pkl')
143
+
144
+ # Example prediction
145
+ features = np.array([[
146
+ 1, # profile_pic
147
+ 0.15, # username_num_ratio
148
+ 0, # username_is_numeric
149
+ 2, # fullname_words
150
+ 0.0, # fullname_num_ratio
151
+ 0, # is_name_number_only
152
+ 0, # name_equals_username
153
+ 1200, # followers
154
+ 300, # follows
155
+ 4.0 # followers_to_follows_ratio
156
+ ]])
157
+
158
+ # Scale features
159
+ features_scaled = scaler.transform(features)
160
+
161
+ # Make prediction
162
+ prediction = model.predict(features_scaled)
163
+ probability = model.predict_proba(features_scaled)
164
+
165
+ print(f"Bot: {prediction[0] == 1}")
166
+ print(f"Probability: {probability[0][1]:.4f}")
167
+ ```
168
+
169
+ ### API Integration
170
+
171
+ ```python
172
+ def predict_instagram_bot(account_data: dict) -> dict:
173
+ """
174
+ Predict if an Instagram account is a bot.
175
+
176
+ Args:
177
+ account_data: Dictionary with account features
178
+
179
+ Returns:
180
+ Dictionary with prediction and probability
181
+ """
182
+ features = np.array([[
183
+ account_data['profile_pic'],
184
+ account_data['username_num_ratio'],
185
+ account_data['username_is_numeric'],
186
+ account_data['fullname_words'],
187
+ account_data['fullname_num_ratio'],
188
+ account_data['is_name_number_only'],
189
+ account_data['name_equals_username'],
190
+ account_data['followers'],
191
+ account_data['follows'],
192
+ account_data['followers_to_follows_ratio']
193
+ ]])
194
+
195
+ features_scaled = scaler.transform(features)
196
+ prediction = model.predict(features_scaled)[0]
197
+ probability = model.predict_proba(features_scaled)[0]
198
+
199
+ return {
200
+ 'is_bot': bool(prediction),
201
+ 'bot_probability': float(probability[1]),
202
+ 'confidence': float(max(probability))
203
+ }
204
+ ```
205
+
206
+ ---
207
+
208
+ ## 📊 Visualization
209
+
210
+ The `images/` directory contains 13 visualization plots:
211
+
212
+ 1. **confusion_matrix.png** - Classification confusion matrix
213
+ 2. **roc_curve.png** - ROC curve with AUC score
214
+ 3. **precision_recall_curve.png** - Precision-recall trade-off
215
+ 4. **feature_importance.png** - Feature importance ranking
216
+ 5. **learning_curve.png** - Model learning curve
217
+ 6. **class_distribution.png** - Training data class distribution
218
+ 7. **prediction_distribution.png** - Prediction score distribution
219
+ 8. **calibration_curve.png** - Probability calibration
220
+ 9. **cv_scores.png** - Cross-validation scores
221
+ 10. **top_features.png** - Top 10 features
222
+ 11. **correlation_matrix.png** - Feature correlation heatmap
223
+ 12. **threshold_analysis.png** - Classification threshold analysis
224
+ 13. **model_comparison.png** - Baseline vs tuned model comparison
225
+
226
+ ---
227
+
228
+ ## 🎓 Model Training
229
+
230
+ ### Training Process
231
+
232
+ 1. **Data Preprocessing**: Feature engineering and normalization
233
+ 2. **Train-Test Split**: 80/20 split with stratification
234
+ 3. **Hyperparameter Tuning**: GridSearchCV with 5-fold cross-validation
235
+ 4. **Model Selection**: Best parameters based on ROC-AUC score
236
+ 5. **Evaluation**: Comprehensive metrics on held-out test set
237
+
238
+ ### Cross-Validation
239
+
240
+ - **Mean ROC-AUC:** 0.9979
241
+ - **Folds:** 5
242
+ - **Strategy:** Stratified K-Fold
243
+
244
+ ---
245
+
246
+ ## ⚠️ Limitations
247
+
248
+ 1. **Data Dependency**: Model performance depends on feature quality and data accuracy
249
+ 2. **Feature Availability**: All 10 features must be available for prediction
250
+ 3. **Temporal Drift**: Instagram's platform and bot behavior may change over time
251
+ 4. **Privacy**: Ensure compliance with Instagram's terms of service when collecting data
252
+ 5. **Threshold Sensitivity**: Default threshold is 0.5; may need adjustment based on use case
253
+
254
+ ---
255
+
256
+ ## 📝 License
257
+
258
+ This model is released under the **Apache License 2.0**.
259
+
260
+ ---
261
+
262
+ ## 🔄 Version History
263
+
264
+ - **v2** (2025-11-27): Current version with hyperparameter tuning
265
+ - ROC-AUC: 0.9979
266
+ - Accuracy: 97.88%
267
+ - 10 features
268
+
269
+ ---
270
+
271
+ ## 📧 Contact & Citation
272
+
273
+ If you use this model in your research or application, please cite:
274
+
275
+ ```bibtex
276
+ @misc{instagram-bot-detection-v2,
277
+ title={Instagram Bot Detection Model v2},
278
+ author={Nahiar},
279
+ year={2025},
280
+ month={November},
281
+ publisher={Hugging Face},
282
+ howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}}
283
+ }
284
+ ```
285
+
286
+ ---
287
+
288
+ ## 🤝 Contributing
289
+
290
+ For issues, improvements, or questions, please contact the model maintainer.
images/01_class_distribution.png ADDED
images/02_feature_correlation.png ADDED
images/03_correlation_matrix.png ADDED
images/04_baseline_confussion_matrix.png ADDED
images/05_baseline_roc_curve.png ADDED
images/06_baseline_precision_recall.png ADDED
images/07_baseline_feature_important.png ADDED
images/08_cross_validation.png ADDED
images/09_tuned_confussion_matrix.png ADDED
images/10_tuned_roc_curve.png ADDED
images/11_tuned_precision_recall.png ADDED
images/12_model_comparison.png ADDED
instagram_bot_detection_model.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:29fbc4e428d4c5984d733bc494054af84f45ca600ec801e8eb05089faa3ad34d
3
+ size 1676089
instagram_features.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "profile_pic",
3
+ "username_num_ratio",
4
+ "username_is_numeric",
5
+ "fullname_words",
6
+ "fullname_num_ratio",
7
+ "is_name_number_only",
8
+ "name_equals_username",
9
+ "followers",
10
+ "follows",
11
+ "followers_to_follows_ratio"
12
+ ]
instagram_features.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ profile_pic
2
+ username_num_ratio
3
+ username_is_numeric
4
+ fullname_words
5
+ fullname_num_ratio
6
+ is_name_number_only
7
+ name_equals_username
8
+ followers
9
+ follows
10
+ followers_to_follows_ratio
instagram_metrics.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model: Instagram Bot Detection
2
+ Date: 2025-12-29 22:02:58.371023
3
+
4
+ ============================================================
5
+ Performance Metrics
6
+ ============================================================
7
+
8
+ Accuracy: 0.9788
9
+ Precision: 0.9923
10
+ Recall: 0.9652
11
+ F1: 0.9786
12
+ Roc_auc: 0.9979
13
+ Avg_precision: 0.9982
14
+
15
+ Best Parameters:
16
+ class_weight: balanced
17
+ max_depth: 30
18
+ max_features: sqrt
19
+ min_samples_leaf: 1
20
+ min_samples_split: 2
21
+ n_estimators: 100
22
+
23
+ Cross-Validation ROC-AUC: 0.9984 (+/- 0.0013)
instagram_model_comparison.csv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Metric,Baseline,Tuned,Improvement,Improvement %
2
+ Accuracy,0.986,0.986,0.0,0.0
3
+ Precision,0.9917525773195877,0.9917525773195877,0.0,0.0
4
+ Recall,0.9796334012219959,0.9796334012219959,0.0,0.0
5
+ F1-Score,0.985655737704918,0.985655737704918,0.0,0.0
6
+ ROC-AUC,0.998803612370408,0.9989916733021498,0.00018806093174172922,0.018828619501626963
7
+ Avg Precision,0.9988880328453673,0.9990380665868485,0.00015003374148114812,0.015020075979263841
instagram_scaler.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cde4b5fa22e39296618b8401fd2a3d47f474b2b20051dbdaa2d8c03bf618b197
3
+ size 1591
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ scikit-learn>=1.3.0
2
+ pandas>=2.0.0
3
+ numpy>=1.24.0
4
+ joblib>=1.3.0