nahiar commited on
Commit
df39e77
·
verified ·
1 Parent(s): 8255c74

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -1 +1,11 @@
1
  *.pkl filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
1
  *.pkl filter=lfs diff=lfs merge=lfs -text
2
+ images/01_class_distribution.png filter=lfs diff=lfs merge=lfs -text
3
+ images/02_feature_correlation.png filter=lfs diff=lfs merge=lfs -text
4
+ images/03_correlation_matrix.png filter=lfs diff=lfs merge=lfs -text
5
+ images/05_baseline_roc_curve.png filter=lfs diff=lfs merge=lfs -text
6
+ images/07_baseline_feature_importance.png filter=lfs diff=lfs merge=lfs -text
7
+ images/08_cross_validation.png filter=lfs diff=lfs merge=lfs -text
8
+ images/10_tuned_roc_curve.png filter=lfs diff=lfs merge=lfs -text
9
+ images/11_tuned_precision_recall.png filter=lfs diff=lfs merge=lfs -text
10
+ images/12_tuned_feature_importance.png filter=lfs diff=lfs merge=lfs -text
11
+ images/13_model_comparison.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,315 +1,273 @@
1
  ---
2
- language: en
3
- license: mit
 
4
  tags:
5
- - bot-detection
6
- - twitter
7
- - random-forest
8
- - sklearn
9
- - social-media
10
- - classification
11
- metrics:
12
- - accuracy
13
- - precision
14
- - recall
15
- - f1
16
- - roc-auc
17
- library_name: scikit-learn
18
  ---
19
 
20
- # Twitter Bot Detection Model
21
 
22
- ## Model Description
23
 
24
- This Random Forest classifier is designed to detect bot accounts on Twitter/X based on profile features and behavioral patterns. The model analyzes various account characteristics to determine whether an account is likely automated (bot) or genuine (human).
25
 
26
- ## Model Details
 
 
 
27
 
28
- - **Model Type**: Random Forest Classifier
29
- - **Framework**: scikit-learn
30
- - **Task**: Binary Classification (Bot vs Human)
31
- - **Language**: Python
32
- - **License**: MIT
33
 
34
- ## Performance Metrics
35
 
36
- The model achieves strong performance on the test dataset with optimized hyperparameters:
37
 
38
- - **High Accuracy**: Excellent accuracy in distinguishing bots from legitimate accounts
39
- - **Robust Classification**: Trained with cross-validation for reliable performance
40
- - **Version**: v2 (improved and optimized)
 
 
 
 
 
41
 
42
- The model has been fine-tuned specifically for Twitter's unique features and bot patterns.
43
 
44
- ## Features Used
 
 
45
 
46
- The model uses the following features for prediction:
47
 
48
- 1. **IsPrivate** - Whether the account is protected/private
49
- 2. **IsVerified** - Whether the account has a verification badge (blue checkmark)
50
- 3. **HasProfilePic** - Whether the account has a profile picture
51
- 4. **FollowingCount** - Number of accounts being followed
52
- 5. **FollowerCount** - Number of followers
53
- 6. **HasLocation** - Whether location information is provided
54
- 7. **HasDescription** - Whether the account has a bio/description
55
- 8. **TweetsCount** - Total number of tweets posted
56
- 9. **FollowToFollowerRatio** - Ratio of following to followers
57
- 10. **AccountAge** - Age of the account (if available)
58
- 11. **HasUrl** - Whether there's a URL in the profile
59
- 12. **DefaultProfileImage** - Whether using default profile image
60
 
61
- ## Intended Use
 
 
 
 
 
 
 
62
 
63
- ### Primary Uses
64
 
65
- - Identifying potential bot accounts on Twitter/X
66
- - Content moderation and platform integrity
67
- - Research on social media bot behavior and misinformation campaigns
68
- - Automated account screening for spam detection
69
- - Election integrity and political bot detection
70
 
71
- ### Out-of-Scope Uses
72
 
73
- - This model is specifically trained for Twitter/X and should not be used for other platforms without retraining
74
- - Should not be the sole basis for account suspension decisions
75
- - Not designed for real-time detection without proper infrastructure
76
- - Not suitable for detecting state-sponsored advanced persistent threats without additional features
77
- - Should not be used to target legitimate users based on behavior patterns
 
78
 
79
- ## How to Use
80
 
81
- ### Installation
82
 
83
- ```bash
84
- pip install scikit-learn pandas numpy joblib
85
- ```
86
 
87
- ### Loading the Model
88
 
89
- ```python
90
- import joblib
91
- import pandas as pd
92
- import numpy as np
93
- from sklearn.preprocessing import MinMaxScaler
94
-
95
- # Load the model
96
- model = joblib.load('Twitter_BOT_Detection_Model_v1.pkl')
97
-
98
- # Prepare your data
99
- features = ['IsPrivate', 'IsVerified', 'HasProfilePic', 'FollowingCount',
100
- 'FollowerCount', 'HasLocation', 'HasDescription', 'TweetsCount',
101
- 'FollowToFollowerRatio', 'AccountAge', 'HasUrl', 'DefaultProfileImage']
102
-
103
- # Example account data
104
- account_data = {
105
- 'IsPrivate': 0,
106
- 'IsVerified': 0,
107
- 'HasProfilePic': 1,
108
- 'FollowingCount': 5000,
109
- 'FollowerCount': 50,
110
- 'HasLocation': 0,
111
- 'HasDescription': 0,
112
- 'TweetsCount': 10000,
113
- 'FollowToFollowerRatio': 100.0,
114
- 'AccountAge': 30, # days
115
- 'HasUrl': 1,
116
- 'DefaultProfileImage': 0
117
- }
118
 
119
- # Create DataFrame
120
- df = pd.DataFrame([account_data])
121
 
122
- # Scale features (use the same scaler as training)
123
- scaler = MinMaxScaler()
124
- # Note: In production, you should save and load the scaler from training
125
- df_scaled = scaler.fit_transform(df[features])
 
 
 
 
 
 
 
 
 
 
126
 
127
- # Make prediction
128
- prediction = model.predict(df_scaled)
129
- probability = model.predict_proba(df_scaled)
130
 
131
- print(f"Prediction: {'Bot' if prediction[0] == 1 else 'Human'}")
132
- print(f"Confidence - Human: {probability[0][0]:.2%}, Bot: {probability[0][1]:.2%}")
133
- ```
134
 
135
- ### Batch Prediction with Threshold
 
 
 
 
136
 
137
- ```python
138
- # For multiple accounts
139
- accounts_df = pd.read_csv('twitter_accounts_to_check.csv')
140
- accounts_scaled = scaler.transform(accounts_df[features])
141
 
142
- predictions = model.predict(accounts_scaled)
143
- probabilities = model.predict_proba(accounts_scaled)
144
 
145
- # Add results to DataFrame
146
- accounts_df['is_bot'] = predictions
147
- accounts_df['bot_probability'] = probabilities[:, 1]
148
 
149
- # Filter by confidence threshold
150
- high_confidence_bots = accounts_df[accounts_df['bot_probability'] > 0.9]
151
- suspected_bots = accounts_df[(accounts_df['bot_probability'] > 0.7) &
152
- (accounts_df['bot_probability'] <= 0.9)]
153
- ```
 
154
 
155
- ### Integration Example
156
 
157
- ```python
158
- class TwitterBotDetector:
159
- def __init__(self, model_path):
160
- self.model = joblib.load(model_path)
161
- self.scaler = MinMaxScaler()
162
- self.features = ['IsPrivate', 'IsVerified', 'HasProfilePic',
163
- 'FollowingCount', 'FollowerCount', 'HasLocation',
164
- 'HasDescription', 'TweetsCount', 'FollowToFollowerRatio',
165
- 'AccountAge', 'HasUrl', 'DefaultProfileImage']
166
-
167
- def predict(self, account_features):
168
- """Predict if an account is a bot"""
169
- df = pd.DataFrame([account_features])
170
- df_scaled = self.scaler.fit_transform(df[self.features])
171
- prediction = self.model.predict(df_scaled)[0]
172
- probability = self.model.predict_proba(df_scaled)[0]
173
-
174
- return {
175
- 'is_bot': bool(prediction),
176
- 'bot_probability': float(probability[1]),
177
- 'human_probability': float(probability[0])
178
- }
179
-
180
- # Usage
181
- detector = TwitterBotDetector('Twitter_BOT_Detection_Model_v1.pkl')
182
- result = detector.predict(account_data)
183
- print(result)
184
- ```
185
 
186
- ## Training Data
187
 
188
- The model was trained on a comprehensive dataset of Twitter accounts with labeled bot/human classifications. The dataset includes:
189
 
190
- - Balanced distribution of bot and human accounts
191
- - Various bot types (spam bots, political bots, engagement bots, etc.)
192
- - Diverse account types, ages, and activity levels
193
- - Features extracted from public profile information
194
 
195
- **Note**: The training data is proprietary and not included in this repository.
196
 
197
- ## Training Procedure
 
 
 
 
198
 
199
- ### Preprocessing
200
 
201
- 1. Feature extraction from Twitter account profiles via API
202
- 2. Calculation of derived features (FollowToFollowerRatio, AccountAge)
203
- 3. Handling of missing values and outliers
204
- 4. MinMax normalization of all features to [0, 1] range
205
- 5. Train-test split with stratification to maintain class balance
206
 
207
- ### Hyperparameters
208
 
209
- - **Algorithm**: Random Forest Classifier
210
- - **Version**: v2 (optimized)
211
- - **Normalization**: MinMaxScaler
212
- - **Cross-validation**: Stratified K-Fold
213
- - **Feature Selection**: Based on domain knowledge and feature importance analysis
 
 
 
 
 
 
 
 
214
 
215
- The model was trained using scikit-learn's RandomForestClassifier with optimized hyperparameters selected through extensive cross-validation and grid search.
216
 
217
- ## Limitations and Bias
218
 
219
- ### Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
220
 
221
- - Model performance depends on the quality and accuracy of input features
222
- - May not generalize to new bot patterns not seen during training
223
- - Requires access to Twitter API for feature extraction
224
- - Performance may degrade over time as bot behaviors evolve rapidly
225
- - Limited to profile-level features; does not analyze tweet content deeply
226
- - May struggle with sophisticated bots that mimic human behavior closely
227
- - Requires regular updates due to platform changes (Twitter → X)
228
 
229
- ### Potential Biases
 
 
 
 
 
 
 
230
 
231
- - May be biased toward bot patterns present in the training data
232
- - Could have temporal biases based on when training data was collected
233
- - May misclassify legitimate accounts with unusual behavior patterns
234
- - Potential bias against new accounts or accounts with low activity
235
- - Could reflect biases in the original labeling process
236
- - May have difficulty with non-English accounts if training data is primarily English
237
 
238
- ### Recommendations
239
 
240
- - Regularly retrain the model with new data to capture evolving bot patterns
241
- - Use as part of a multi-layered detection system including content analysis
242
- - Implement human review for high-stakes decisions
243
- - Monitor for false positives and adjust classification thresholds based on use case
244
- - Combine with tweet content analysis, network analysis, and temporal patterns
245
- - Consider context (political events, trending topics) when interpreting results
246
- - Validate performance across different account types and languages
247
 
248
- ## Ethical Considerations
 
 
 
 
 
249
 
250
- - This model should be used responsibly and not for harassment, doxxing, or targeting
251
- - Consider privacy implications when analyzing user accounts
252
- - Ensure compliance with Twitter/X's terms of service and relevant privacy laws (GDPR, CCPA, etc.)
253
- - Implement appropriate safeguards against misuse
254
- - Provide transparency to users about automated detection systems
255
- - Allow for appeals and manual review processes
256
- - Be aware of potential for false accusations
257
- - Consider impact on freedom of speech and legitimate automated accounts (news bots, etc.)
258
- - Monitor for discriminatory outcomes across different user groups
259
 
260
- ## Known Issues
261
 
262
- - Twitter's API changes may affect feature availability
263
- - Platform rebranding (Twitter → X) may introduce new bot patterns
264
- - Changes in verification system may affect IsVerified feature utility
265
 
266
- ## Model Card Authors
267
 
268
- This model card was created as part of the Bot Detection project for social media platforms.
 
 
269
 
270
- ## Citation
271
 
272
- If you use this model in your research, please cite:
 
 
273
 
274
- ```bibtex
275
- @misc{twitter_bot_detection_2024,
276
- title={Twitter Bot Detection Model v2},
277
- author={Your Name/Organization},
278
- year={2024},
279
- publisher={Hugging Face},
280
- howpublished={\url{https://huggingface.co/your-username/twitter-bot-detection}}
281
- }
282
- ```
283
 
284
- ## Related Models
285
 
286
- - [TikTok Bot Detection](https://huggingface.co/your-username/tiktok-bot-detection)
287
- - [Instagram Bot Detection](https://huggingface.co/your-username/instagram-bot-detection)
 
 
288
 
289
- ## Contact
290
 
291
- For questions or feedback about this model, please open an issue in the repository or contact the maintainers.
292
 
293
- ## Updates and Maintenance
294
 
295
- - **Version**: 2.0
296
- - **Last Updated**: November 2024
297
- - **Status**: Active
298
 
299
- ### Changelog
300
 
301
- - **v2.0**: Improved hyperparameters, better cross-validation, optimized for current Twitter/X platform
302
- - **v1.0**: Initial release
303
 
304
- ### Future Updates
305
 
306
- Future updates may include:
307
 
308
- - Improved feature engineering based on new platform features
309
- - Additional training data with recent bot patterns
310
- - Deep learning approaches for complex bot detection
311
- - Integration of tweet content analysis (NLP features)
312
- - Network graph analysis for coordinated bot detection
313
- - Temporal pattern analysis
314
- - Support for multilingual accounts
315
- - Real-time feature extraction pipeline
 
1
  ---
2
+ language: "en"
3
+ license: "apache-2.0"
4
+ library_name: "scikit-learn"
5
  tags:
6
+ - "bot-detection"
7
+ - "twitter"
8
+ - "classification"
9
+ - "scikit-learn"
10
+ - "random-forest"
 
 
 
 
 
 
 
 
11
  ---
12
 
13
+ # TWITTER Bot Detection Model
14
 
15
+ ## Overview
16
 
17
+ This directory contains a trained Random Forest classifier for detecting bot accounts on Twitter.
18
 
19
+ **Model Version:** v2
20
+ **Training Date:** 2025-11-27 12:08:54
21
+ **Framework:** scikit-learn 1.5.2
22
+ **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
23
 
24
+ ---
 
 
 
 
25
 
26
+ ## 📊 Model Performance
27
 
28
+ ### Final Metrics (Test Set)
29
 
30
+ | Metric | Score |
31
+ | --------------------- | --------------- |
32
+ | **Accuracy** | 0.8771 (87.71%) |
33
+ | **Precision** | 0.8595 (85.95%) |
34
+ | **Recall** | 0.7558 (75.58%) |
35
+ | **F1-Score** | 0.8043 (80.43%) |
36
+ | **ROC-AUC** | 0.9354 (93.54%) |
37
+ | **Average Precision** | 0.9008 (90.08%) |
38
 
39
+ ### Model Improvement
40
 
41
+ - **Baseline ROC-AUC:** 0.9314
42
+ - **Tuned ROC-AUC:** 0.9354
43
+ - **Improvement:** 0.0040 (0.43%)
44
 
45
+ ---
46
 
47
+ ## 🗂️ Files
 
 
 
 
 
 
 
 
 
 
 
48
 
49
+ | File | Description |
50
+ | ------------------------------ | -------------------------------------- |
51
+ | `twitter_bot_detection_v2.pkl` | Trained Random Forest model |
52
+ | `twitter_scaler_v2.pkl` | MinMaxScaler for feature normalization |
53
+ | `twitter_features_v2.json` | List of features used by the model |
54
+ | `twitter_metrics_v2.txt` | Detailed performance metrics report |
55
+ | `images/` | All visualization plots (13 images) |
56
+ | `README.md` | This file |
57
 
58
+ ---
59
 
60
+ ## 🎯 Dataset Information
 
 
 
 
61
 
62
+ ### Training Configuration
63
 
64
+ - **Training Samples:** 29,951
65
+ - **Test Samples:** 7,487
66
+ - **Total Samples:** 37,438
67
+ - **Number of Features:** 12
68
+ - **Cross-Validation Folds:** 5
69
+ - **Random State:** 42
70
 
71
+ ### Class Distribution
72
 
73
+ **Training Set:**
74
 
75
+ - Human (0): 20,028 (66.87%)
76
+ - Bot (1): 9,923 (33.13%)
 
77
 
78
+ **Test Set:**
79
 
80
+ - Human (0): 4,985 (66.58%)
81
+ - Bot (1): 2,502 (33.42%)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
+ ---
 
84
 
85
+ ## 🔧 Features (12)
86
+
87
+ 1. `has_custom_cover_image`
88
+ 2. `description_length`
89
+ 3. `favourites_count`
90
+ 4. `followers_count`
91
+ 5. `friends_count`
92
+ 6. `followers_to_friends_ratio`
93
+ 7. `has_location`
94
+ 8. `username_digit_count`
95
+ 9. `username_length`
96
+ 10. `statuses_count`
97
+ 11. `is_verified`
98
+ 12. `account_age_days`
99
 
100
+ ---
 
 
101
 
102
+ ## 🏆 Top 5 Most Important Features
 
 
103
 
104
+ 4. **followers_count** - 0.1895
105
+ 5. **favourites_count** - 0.1813
106
+ 6. **friends_count** - 0.1494
107
+ 7. **statuses_count** - 0.1244
108
+ 8. **account_age_days** - 0.1010
109
 
110
+ ---
 
 
 
111
 
112
+ ## ⚙️ Hyperparameters
 
113
 
114
+ ### Best Parameters (from GridSearchCV)
 
 
115
 
116
+ - **class_weight:** balanced
117
+ - **max_depth:** 20
118
+ - **max_features:** sqrt
119
+ - **min_samples_leaf:** 1
120
+ - **min_samples_split:** 2
121
+ - **n_estimators:** 300
122
 
123
+ ### Parameter Search Space
124
 
125
+ - **n_estimators:** [100, 200, 300]
126
+ - **max_depth:** [10, 15, 20, None]
127
+ - **min_samples_split:** [2, 5, 10]
128
+ - **min_samples_leaf:** [1, 2, 4]
129
+ - **max_features:** ['sqrt', 'log2']
130
+ - **bootstrap:** [True, False]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
+ **Total combinations tested:** 540
133
 
134
+ ---
135
 
136
+ ## 📈 Cross-Validation Results
 
 
 
137
 
138
+ ### Mean Scores (5-Fold Stratified CV)
139
 
140
+ - **Accuracy:** 0.8750 (±0.0053)
141
+ - **Precision:** 0.8658 (±0.0089)
142
+ - **Recall:** 0.7368 (±0.0113)
143
+ - **F1-Score:** 0.7961 (±0.0092)
144
+ - **ROC-AUC:** 0.9325 (±0.0037)
145
 
146
+ ---
147
 
148
+ ## 🖼️ Visualizations
 
 
 
 
149
 
150
+ All visualizations are saved in the `images/` directory:
151
 
152
+ 1. **01_class_distribution.png** - Training/Test set class distribution
153
+ 2. **02_feature_correlation.png** - Feature correlation with target variable
154
+ 3. **03_correlation_matrix.png** - Feature correlation heatmap
155
+ 4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
156
+ 5. **05_baseline_roc_curve.png** - Baseline ROC curve
157
+ 6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
158
+ 7. **07_baseline_feature_importance.png** - Baseline feature importance
159
+ 8. **08_cross_validation.png** - Cross-validation score distribution
160
+ 9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
161
+ 10. **10_tuned_roc_curve.png** - Tuned ROC curve
162
+ 11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
163
+ 12. **12_tuned_feature_importance.png** - Tuned feature importance
164
+ 13. **13_model_comparison.png** - Baseline vs Tuned comparison
165
 
166
+ ---
167
 
168
+ ## 🚀 Usage Example
169
 
170
+ ```python
171
+ import joblib
172
+ import pandas as pd
173
+ import numpy as np
174
+
175
+ # Load model and scaler
176
+ model = joblib.load('twitter_bot_detection_v2.pkl')
177
+ scaler = joblib.load('twitter_scaler_v2.pkl')
178
+
179
+ # Prepare your data (example)
180
+ data = {
181
+ 'has_custom_cover_image': 0.5,
182
+ 'description_length': 0.5,
183
+ 'favourites_count': 0.5,
184
+ 'followers_count': 0.5,
185
+ 'friends_count': 0.5,
186
+ 'followers_to_friends_ratio': 0.5,
187
+ 'has_location': 0.5,
188
+ 'username_digit_count': 0.5,
189
+ 'username_length': 0.5,
190
+ 'statuses_count': 0.5,
191
+ 'is_verified': 0.5,
192
+ 'account_age_days': 0.5,
193
+ }
194
+
195
+ # Create DataFrame
196
+ df = pd.DataFrame([data])
197
 
198
+ # Scale features
199
+ df_scaled = scaler.transform(df)
 
 
 
 
 
200
 
201
+ # Predict
202
+ prediction = model.predict(df_scaled)[0]
203
+ probability = model.predict_proba(df_scaled)[0]
204
+
205
+ print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
206
+ print(f"Bot Probability: {probability[1]:.4f}")
207
+ print(f"Human Probability: {probability[0]:.4f}")
208
+ ```
209
 
210
+ ---
 
 
 
 
 
211
 
212
+ ## 📋 Confusion Matrix Breakdown
213
 
214
+ ### Tuned Model (Test Set)
 
 
 
 
 
 
215
 
216
+ ```
217
+ Predicted
218
+ Human Bot
219
+ Actual Human 4676 309
220
+ Bot 611 1891
221
+ ```
222
 
223
+ - **True Negatives (TN):** 4,676 (Correctly identified humans)
224
+ - **False Positives (FP):** 309 (Humans incorrectly classified as bots)
225
+ - **False Negatives (FN):** 611 (Bots incorrectly classified as humans)
226
+ - **True Positives (TP):** 1,891 (Correctly identified bots)
 
 
 
 
 
227
 
228
+ ---
229
 
230
+ ## 🔍 Model Interpretation
 
 
231
 
232
+ ### Strengths
233
 
234
+ - High ROC-AUC score (0.9354) indicates excellent discrimination capability
235
+ - Balanced precision and recall for both classes
236
+ - Robust cross-validation performance
237
 
238
+ ### Key Insights
239
 
240
+ 1. Top features drive bot classification effectively
241
+ 2. GridSearchCV improved performance over baseline by 0.43%
242
+ 3. Model generalizes well on unseen test data
243
 
244
+ ---
 
 
 
 
 
 
 
 
245
 
246
+ ## 📝 Notes
247
 
248
+ - **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
249
+ - **Missing Values:** Filled with 0 during preprocessing
250
+ - **Class Balance:** Imbalanced dataset
251
+ - **Model Type:** Ensemble method resistant to overfitting
252
 
253
+ ---
254
 
255
+ ## 🔄 Model Updates
256
 
257
+ To retrain the model:
258
 
259
+ 1. Place new training data in `../data/train_twitter.csv`
260
+ 2. Run the training notebook: `5_enhanced_training.ipynb`
261
+ 3. Update this README with new metrics
262
 
263
+ ---
264
 
265
+ ## 📧 Contact & Support
 
266
 
267
+ For questions or issues regarding this model, please refer to the main project documentation.
268
 
269
+ ---
270
 
271
+ **Generated:** 2025-11-27 12:08:54
272
+ **Notebook:** `5_enhanced_training.ipynb`
273
+ **Platform:** Twitter
 
 
 
 
 
images/01_class_distribution.png ADDED

Git LFS Details

  • SHA256: 28e8b369a102f23ebc8a38d7a7750ee3cf228891f50fb6ee27ff63137f23e189
  • Pointer size: 131 Bytes
  • Size of remote file: 126 kB
images/02_feature_correlation.png ADDED

Git LFS Details

  • SHA256: 8a63a4b708b69990192735645821748b2193a0c78c387919b879f8ea56fa04c3
  • Pointer size: 131 Bytes
  • Size of remote file: 149 kB
images/03_correlation_matrix.png ADDED

Git LFS Details

  • SHA256: db5060fb105b50996edecf5e7580bb0bc61e358ba36a5882cfed9cc9fbe96d31
  • Pointer size: 131 Bytes
  • Size of remote file: 417 kB
images/04_baseline_confusion_matrix.png ADDED
images/05_baseline_roc_curve.png ADDED

Git LFS Details

  • SHA256: b47274a0ea64edd968129d481608ec407bff807ed8699d45c808d6db8a4599f0
  • Pointer size: 131 Bytes
  • Size of remote file: 149 kB
images/06_baseline_precision_recall.png ADDED
images/07_baseline_feature_importance.png ADDED

Git LFS Details

  • SHA256: 085137d8353282ddc379f6a332f3e86c416e481dc44f018f70de213f731d4a54
  • Pointer size: 131 Bytes
  • Size of remote file: 150 kB
images/08_cross_validation.png ADDED

Git LFS Details

  • SHA256: f3146b19219aec9729fc14fdf7e3c32dc03cee0cfad38fec09d373088fe6d1e0
  • Pointer size: 131 Bytes
  • Size of remote file: 109 kB
images/09_tuned_confusion_matrix.png ADDED
images/10_tuned_roc_curve.png ADDED

Git LFS Details

  • SHA256: c272ba78d06a22f2854fcd106311301508821fb89747a921d0b2b817e89a2d01
  • Pointer size: 131 Bytes
  • Size of remote file: 150 kB
images/11_tuned_precision_recall.png ADDED

Git LFS Details

  • SHA256: 3219b7fc470ed3a16c4c0ce46c6666ab61fd08ba84c24d09ce4330c034c97f48
  • Pointer size: 131 Bytes
  • Size of remote file: 101 kB
images/12_tuned_feature_importance.png ADDED

Git LFS Details

  • SHA256: 8a0e401621b1cf2f79783ec0e158dbf80515d7befd29c2dc3e79b606e6f3ad86
  • Pointer size: 131 Bytes
  • Size of remote file: 147 kB
images/13_model_comparison.png ADDED

Git LFS Details

  • SHA256: c64122d691fb9949705ecba91504d49d1e07a8dbb35611ef3322aa8f870dbcac
  • Pointer size: 131 Bytes
  • Size of remote file: 137 kB
twitter_bot_detection_v2.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8a77a8bb9af5ce0909ae5dd72f18176d12795e3f2ccc658fb4e519d325db19b2
3
+ size 144062585
twitter_features_v2.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "has_custom_cover_image",
3
+ "description_length",
4
+ "favourites_count",
5
+ "followers_count",
6
+ "friends_count",
7
+ "followers_to_friends_ratio",
8
+ "has_location",
9
+ "username_digit_count",
10
+ "username_length",
11
+ "statuses_count",
12
+ "is_verified",
13
+ "account_age_days"
14
+ ]
twitter_metrics_v2.txt ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ======================================================================
2
+ TWITTER Bot Detection Model - Performance Report
3
+ ======================================================================
4
+
5
+ Date: 2025-11-27 12:08:54.462391
6
+
7
+ Training Configuration:
8
+ - Platform: twitter
9
+ - Train samples: 29951
10
+ - Test samples: 7487
11
+ - Features: 12
12
+ - CV Folds: 5
13
+ - Random State: 42
14
+
15
+ Best Hyperparameters:
16
+ - class_weight: balanced
17
+ - max_depth: 20
18
+ - max_features: sqrt
19
+ - min_samples_leaf: 1
20
+ - min_samples_split: 2
21
+ - n_estimators: 300
22
+
23
+ Performance Metrics (Test Set):
24
+ - Accuracy: 0.8771
25
+ - Precision: 0.8595
26
+ - Recall: 0.7558
27
+ - F1: 0.8043
28
+ - Roc_auc: 0.9354
29
+ - Avg_precision: 0.9008
30
+
31
+ Cross-Validation Results:
32
+ - Mean ROC-AUC: 0.9352
33
+
34
+ Feature Importance (Top 5):
35
+ - followers_count: 0.1895
36
+ - favourites_count: 0.1813
37
+ - friends_count: 0.1494
38
+ - statuses_count: 0.1244
39
+ - account_age_days: 0.1010
twitter_model_comparison.csv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Metric,Baseline,Tuned,Improvement,Improvement %
2
+ Accuracy,0.8740483504741552,0.8771203419260051,0.003071991451849887,0.3514669926650382
3
+ Precision,0.8678621991505427,0.8595454545454545,-0.008316744605088244,-0.9583024370952686
4
+ Recall,0.7350119904076738,0.7557953637090328,0.02078337330135893,2.827623708537251
5
+ F1-Score,0.7959316165332179,0.8043385793279455,0.008406962794727635,1.0562418454169766
6
+ ROC-AUC,0.9314009975570197,0.9353899828983353,0.003988985341315643,0.4282779760574006
7
+ Avg Precision,0.8949209792592716,0.9007641701676647,0.005843190908393137,0.6529281404520834
twitter_scaler_v2.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c8725c0a395abc30b368dfe7a64051f0095fc7acfb4cfb1ac4229ffe73f02a32
3
+ size 1623