nahiar commited on
Commit
f825cc5
Β·
verified Β·
1 Parent(s): 247cc2b

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +279 -3
README.md CHANGED
@@ -8,8 +8,284 @@ tags:
8
  - "classification"
9
  ---
10
 
11
- # nahiar/instagram-bot-detection
12
 
13
- A short description of this model.
14
 
15
- -- Add details for: how to use, training data, limitations, citation, and license.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - "classification"
9
  ---
10
 
11
+ # INSTAGRAM Bot Detection Model
12
 
13
+ ## Overview
14
 
15
+ This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.
16
+
17
+ **Model Version:** v2
18
+ **Training Date:** 2025-11-27 11:38:28
19
+ **Framework:** scikit-learn 1.5.2
20
+ **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
21
+
22
+ ---
23
+
24
+ ## πŸ“Š Model Performance
25
+
26
+ ### Final Metrics (Test Set)
27
+
28
+ | Metric | Score |
29
+ | --------------------- | --------------- |
30
+ | **Accuracy** | 0.9860 (98.60%) |
31
+ | **Precision** | 0.9918 (99.18%) |
32
+ | **Recall** | 0.9796 (97.96%) |
33
+ | **F1-Score** | 0.9857 (98.57%) |
34
+ | **ROC-AUC** | 0.9990 (99.90%) |
35
+ | **Average Precision** | 0.9990 (99.90%) |
36
+
37
+ ### Model Improvement
38
+
39
+ - **Baseline ROC-AUC:** 0.9988
40
+ - **Tuned ROC-AUC:** 0.9990
41
+ - **Improvement:** 0.0002 (0.02%)
42
+
43
+ ---
44
+
45
+ ## πŸ—‚οΈ Files
46
+
47
+ | File | Description |
48
+ | -------------------------------- | -------------------------------------- |
49
+ | `instagram_bot_detection_v2.pkl` | Trained Random Forest model |
50
+ | `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization |
51
+ | `instagram_features_v2.json` | List of features used by the model |
52
+ | `instagram_metrics_v2.txt` | Detailed performance metrics report |
53
+ | `images/` | All visualization plots (13 images) |
54
+ | `README.md` | This file |
55
+
56
+ ---
57
+
58
+ ## 🎯 Dataset Information
59
+
60
+ ### Training Configuration
61
+
62
+ - **Training Samples:** 4,000
63
+ - **Test Samples:** 1,000
64
+ - **Total Samples:** 5,000
65
+ - **Features:** 10
66
+ - **Cross-Validation Folds:** 5
67
+ - **Random State:** 42
68
+
69
+ ### Class Distribution
70
+
71
+ The model was trained with balanced class weights to handle any class imbalance in the dataset.
72
+
73
+ ---
74
+
75
+ ## πŸ”§ Model Architecture
76
+
77
+ ### Algorithm
78
+
79
+ **Random Forest Classifier** - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.
80
+
81
+ ### Hyperparameters (Tuned via GridSearchCV)
82
+
83
+ | Parameter | Value |
84
+ | ------------------- | -------- |
85
+ | `n_estimators` | 100 |
86
+ | `max_depth` | 15 |
87
+ | `min_samples_split` | 2 |
88
+ | `min_samples_leaf` | 1 |
89
+ | `max_features` | sqrt |
90
+ | `class_weight` | balanced |
91
+
92
+ ### Feature Preprocessing
93
+
94
+ - **Scaler:** MinMaxScaler (normalizes features to [0, 1] range)
95
+ - **Missing Values:** Handled during data preprocessing
96
+ - **Feature Engineering:** Custom features derived from account metadata
97
+
98
+ ---
99
+
100
+ ## πŸ“ˆ Feature Importance
101
+
102
+ The model uses 10 features to detect bot accounts. Top 5 most important features:
103
+
104
+ | Rank | Feature | Importance | Description |
105
+ | ---- | ---------------------------- | ---------- | ------------------------------------------ |
106
+ | 1 | `profile_pic` | 0.3314 | Indicates if account has a profile picture |
107
+ | 2 | `followers` | 0.2313 | Number of followers |
108
+ | 3 | `username_num_ratio` | 0.1665 | Ratio of numbers in username |
109
+ | 4 | `followers_to_follows_ratio` | 0.1308 | Ratio of followers to following count |
110
+ | 5 | `follows` | 0.0923 | Number of accounts followed |
111
+
112
+ ### All Features
113
+
114
+ 1. `profile_pic` - Profile picture presence
115
+ 2. `username_num_ratio` - Numeric character ratio in username
116
+ 3. `username_is_numeric` - Username is entirely numeric
117
+ 4. `fullname_words` - Number of words in full name
118
+ 5. `fullname_num_ratio` - Numeric character ratio in full name
119
+ 6. `is_name_number_only` - Full name contains only numbers
120
+ 7. `name_equals_username` - Full name matches username
121
+ 8. `followers` - Follower count
122
+ 9. `follows` - Following count
123
+ 10. `followers_to_follows_ratio` - Follower/following ratio
124
+
125
+ ---
126
+
127
+ ## πŸš€ Usage
128
+
129
+ ### Prerequisites
130
+
131
+ ```bash
132
+ pip install scikit-learn joblib numpy
133
+ ```
134
+
135
+ ### Loading the Model
136
+
137
+ ```python
138
+ import joblib
139
+ import numpy as np
140
+
141
+ # Load model and scaler
142
+ model = joblib.load('instagram_bot_detection_v2.pkl')
143
+ scaler = joblib.load('instagram_scaler_v2.pkl')
144
+
145
+ # Example prediction
146
+ features = np.array([[
147
+ 1, # profile_pic
148
+ 0.15, # username_num_ratio
149
+ 0, # username_is_numeric
150
+ 2, # fullname_words
151
+ 0.0, # fullname_num_ratio
152
+ 0, # is_name_number_only
153
+ 0, # name_equals_username
154
+ 1200, # followers
155
+ 300, # follows
156
+ 4.0 # followers_to_follows_ratio
157
+ ]])
158
+
159
+ # Scale features
160
+ features_scaled = scaler.transform(features)
161
+
162
+ # Make prediction
163
+ prediction = model.predict(features_scaled)
164
+ probability = model.predict_proba(features_scaled)
165
+
166
+ print(f"Bot: {prediction[0] == 1}")
167
+ print(f"Probability: {probability[0][1]:.4f}")
168
+ ```
169
+
170
+ ### API Integration
171
+
172
+ ```python
173
+ def predict_instagram_bot(account_data: dict) -> dict:
174
+ """
175
+ Predict if an Instagram account is a bot.
176
+
177
+ Args:
178
+ account_data: Dictionary with account features
179
+
180
+ Returns:
181
+ Dictionary with prediction and probability
182
+ """
183
+ features = np.array([[
184
+ account_data['profile_pic'],
185
+ account_data['username_num_ratio'],
186
+ account_data['username_is_numeric'],
187
+ account_data['fullname_words'],
188
+ account_data['fullname_num_ratio'],
189
+ account_data['is_name_number_only'],
190
+ account_data['name_equals_username'],
191
+ account_data['followers'],
192
+ account_data['follows'],
193
+ account_data['followers_to_follows_ratio']
194
+ ]])
195
+
196
+ features_scaled = scaler.transform(features)
197
+ prediction = model.predict(features_scaled)[0]
198
+ probability = model.predict_proba(features_scaled)[0]
199
+
200
+ return {
201
+ 'is_bot': bool(prediction),
202
+ 'bot_probability': float(probability[1]),
203
+ 'confidence': float(max(probability))
204
+ }
205
+ ```
206
+
207
+ ---
208
+
209
+ ## πŸ“Š Visualization
210
+
211
+ The `images/` directory contains 13 visualization plots:
212
+
213
+ 1. **confusion_matrix.png** - Classification confusion matrix
214
+ 2. **roc_curve.png** - ROC curve with AUC score
215
+ 3. **precision_recall_curve.png** - Precision-recall trade-off
216
+ 4. **feature_importance.png** - Feature importance ranking
217
+ 5. **learning_curve.png** - Model learning curve
218
+ 6. **class_distribution.png** - Training data class distribution
219
+ 7. **prediction_distribution.png** - Prediction score distribution
220
+ 8. **calibration_curve.png** - Probability calibration
221
+ 9. **cv_scores.png** - Cross-validation scores
222
+ 10. **top_features.png** - Top 10 features
223
+ 11. **correlation_matrix.png** - Feature correlation heatmap
224
+ 12. **threshold_analysis.png** - Classification threshold analysis
225
+ 13. **model_comparison.png** - Baseline vs tuned model comparison
226
+
227
+ ---
228
+
229
+ ## πŸŽ“ Model Training
230
+
231
+ ### Training Process
232
+
233
+ 1. **Data Preprocessing**: Feature engineering and normalization
234
+ 2. **Train-Test Split**: 80/20 split with stratification
235
+ 3. **Hyperparameter Tuning**: GridSearchCV with 5-fold cross-validation
236
+ 4. **Model Selection**: Best parameters based on ROC-AUC score
237
+ 5. **Evaluation**: Comprehensive metrics on held-out test set
238
+
239
+ ### Cross-Validation
240
+
241
+ - **Mean ROC-AUC:** 0.9988
242
+ - **Folds:** 5
243
+ - **Strategy:** Stratified K-Fold
244
+
245
+ ---
246
+
247
+ ## ⚠️ Limitations
248
+
249
+ 1. **Data Dependency**: Model performance depends on feature quality and data accuracy
250
+ 2. **Feature Availability**: All 10 features must be available for prediction
251
+ 3. **Temporal Drift**: Instagram's platform and bot behavior may change over time
252
+ 4. **Privacy**: Ensure compliance with Instagram's terms of service when collecting data
253
+ 5. **Threshold Sensitivity**: Default threshold is 0.5; may need adjustment based on use case
254
+
255
+ ---
256
+
257
+ ## πŸ“ License
258
+
259
+ This model is released under the **Apache License 2.0**.
260
+
261
+ ---
262
+
263
+ ## πŸ”„ Version History
264
+
265
+ - **v2** (2025-11-27): Current version with hyperparameter tuning
266
+ - ROC-AUC: 0.9990
267
+ - Accuracy: 98.60%
268
+ - 10 features
269
+
270
+ ---
271
+
272
+ ## πŸ“§ Contact & Citation
273
+
274
+ If you use this model in your research or application, please cite:
275
+
276
+ ```bibtex
277
+ @misc{instagram-bot-detection-v2,
278
+ title={Instagram Bot Detection Model v2},
279
+ author={Nahiar},
280
+ year={2025},
281
+ month={November},
282
+ publisher={Hugging Face},
283
+ howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}}
284
+ }
285
+ ```
286
+
287
+ ---
288
+
289
+ ## 🀝 Contributing
290
+
291
+ For issues, improvements, or questions, please contact the model maintainer.