nahiar commited on
Commit
8255c74
·
verified ·
1 Parent(s): f029883

Upload twitter-bot-detection model

Browse files
Files changed (3) hide show
  1. .gitattributes +0 -34
  2. README.md +315 -0
  3. requirements.txt +4 -0
.gitattributes CHANGED
@@ -1,35 +1 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
  *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.pkl filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - bot-detection
6
+ - twitter
7
+ - random-forest
8
+ - sklearn
9
+ - social-media
10
+ - classification
11
+ metrics:
12
+ - accuracy
13
+ - precision
14
+ - recall
15
+ - f1
16
+ - roc-auc
17
+ library_name: scikit-learn
18
+ ---
19
+
20
+ # Twitter Bot Detection Model
21
+
22
+ ## Model Description
23
+
24
+ This Random Forest classifier is designed to detect bot accounts on Twitter/X based on profile features and behavioral patterns. The model analyzes various account characteristics to determine whether an account is likely automated (bot) or genuine (human).
25
+
26
+ ## Model Details
27
+
28
+ - **Model Type**: Random Forest Classifier
29
+ - **Framework**: scikit-learn
30
+ - **Task**: Binary Classification (Bot vs Human)
31
+ - **Language**: Python
32
+ - **License**: MIT
33
+
34
+ ## Performance Metrics
35
+
36
+ The model achieves strong performance on the test dataset with optimized hyperparameters:
37
+
38
+ - **High Accuracy**: Excellent accuracy in distinguishing bots from legitimate accounts
39
+ - **Robust Classification**: Trained with cross-validation for reliable performance
40
+ - **Version**: v2 (improved and optimized)
41
+
42
+ The model has been fine-tuned specifically for Twitter's unique features and bot patterns.
43
+
44
+ ## Features Used
45
+
46
+ The model uses the following features for prediction:
47
+
48
+ 1. **IsPrivate** - Whether the account is protected/private
49
+ 2. **IsVerified** - Whether the account has a verification badge (blue checkmark)
50
+ 3. **HasProfilePic** - Whether the account has a profile picture
51
+ 4. **FollowingCount** - Number of accounts being followed
52
+ 5. **FollowerCount** - Number of followers
53
+ 6. **HasLocation** - Whether location information is provided
54
+ 7. **HasDescription** - Whether the account has a bio/description
55
+ 8. **TweetsCount** - Total number of tweets posted
56
+ 9. **FollowToFollowerRatio** - Ratio of following to followers
57
+ 10. **AccountAge** - Age of the account (if available)
58
+ 11. **HasUrl** - Whether there's a URL in the profile
59
+ 12. **DefaultProfileImage** - Whether using default profile image
60
+
61
+ ## Intended Use
62
+
63
+ ### Primary Uses
64
+
65
+ - Identifying potential bot accounts on Twitter/X
66
+ - Content moderation and platform integrity
67
+ - Research on social media bot behavior and misinformation campaigns
68
+ - Automated account screening for spam detection
69
+ - Election integrity and political bot detection
70
+
71
+ ### Out-of-Scope Uses
72
+
73
+ - This model is specifically trained for Twitter/X and should not be used for other platforms without retraining
74
+ - Should not be the sole basis for account suspension decisions
75
+ - Not designed for real-time detection without proper infrastructure
76
+ - Not suitable for detecting state-sponsored advanced persistent threats without additional features
77
+ - Should not be used to target legitimate users based on behavior patterns
78
+
79
+ ## How to Use
80
+
81
+ ### Installation
82
+
83
+ ```bash
84
+ pip install scikit-learn pandas numpy joblib
85
+ ```
86
+
87
+ ### Loading the Model
88
+
89
+ ```python
90
+ import joblib
91
+ import pandas as pd
92
+ import numpy as np
93
+ from sklearn.preprocessing import MinMaxScaler
94
+
95
+ # Load the model
96
+ model = joblib.load('Twitter_BOT_Detection_Model_v1.pkl')
97
+
98
+ # Prepare your data
99
+ features = ['IsPrivate', 'IsVerified', 'HasProfilePic', 'FollowingCount',
100
+ 'FollowerCount', 'HasLocation', 'HasDescription', 'TweetsCount',
101
+ 'FollowToFollowerRatio', 'AccountAge', 'HasUrl', 'DefaultProfileImage']
102
+
103
+ # Example account data
104
+ account_data = {
105
+ 'IsPrivate': 0,
106
+ 'IsVerified': 0,
107
+ 'HasProfilePic': 1,
108
+ 'FollowingCount': 5000,
109
+ 'FollowerCount': 50,
110
+ 'HasLocation': 0,
111
+ 'HasDescription': 0,
112
+ 'TweetsCount': 10000,
113
+ 'FollowToFollowerRatio': 100.0,
114
+ 'AccountAge': 30, # days
115
+ 'HasUrl': 1,
116
+ 'DefaultProfileImage': 0
117
+ }
118
+
119
+ # Create DataFrame
120
+ df = pd.DataFrame([account_data])
121
+
122
+ # Scale features (use the same scaler as training)
123
+ scaler = MinMaxScaler()
124
+ # Note: In production, you should save and load the scaler from training
125
+ df_scaled = scaler.fit_transform(df[features])
126
+
127
+ # Make prediction
128
+ prediction = model.predict(df_scaled)
129
+ probability = model.predict_proba(df_scaled)
130
+
131
+ print(f"Prediction: {'Bot' if prediction[0] == 1 else 'Human'}")
132
+ print(f"Confidence - Human: {probability[0][0]:.2%}, Bot: {probability[0][1]:.2%}")
133
+ ```
134
+
135
+ ### Batch Prediction with Threshold
136
+
137
+ ```python
138
+ # For multiple accounts
139
+ accounts_df = pd.read_csv('twitter_accounts_to_check.csv')
140
+ accounts_scaled = scaler.transform(accounts_df[features])
141
+
142
+ predictions = model.predict(accounts_scaled)
143
+ probabilities = model.predict_proba(accounts_scaled)
144
+
145
+ # Add results to DataFrame
146
+ accounts_df['is_bot'] = predictions
147
+ accounts_df['bot_probability'] = probabilities[:, 1]
148
+
149
+ # Filter by confidence threshold
150
+ high_confidence_bots = accounts_df[accounts_df['bot_probability'] > 0.9]
151
+ suspected_bots = accounts_df[(accounts_df['bot_probability'] > 0.7) &
152
+ (accounts_df['bot_probability'] <= 0.9)]
153
+ ```
154
+
155
+ ### Integration Example
156
+
157
+ ```python
158
+ class TwitterBotDetector:
159
+ def __init__(self, model_path):
160
+ self.model = joblib.load(model_path)
161
+ self.scaler = MinMaxScaler()
162
+ self.features = ['IsPrivate', 'IsVerified', 'HasProfilePic',
163
+ 'FollowingCount', 'FollowerCount', 'HasLocation',
164
+ 'HasDescription', 'TweetsCount', 'FollowToFollowerRatio',
165
+ 'AccountAge', 'HasUrl', 'DefaultProfileImage']
166
+
167
+ def predict(self, account_features):
168
+ """Predict if an account is a bot"""
169
+ df = pd.DataFrame([account_features])
170
+ df_scaled = self.scaler.fit_transform(df[self.features])
171
+ prediction = self.model.predict(df_scaled)[0]
172
+ probability = self.model.predict_proba(df_scaled)[0]
173
+
174
+ return {
175
+ 'is_bot': bool(prediction),
176
+ 'bot_probability': float(probability[1]),
177
+ 'human_probability': float(probability[0])
178
+ }
179
+
180
+ # Usage
181
+ detector = TwitterBotDetector('Twitter_BOT_Detection_Model_v1.pkl')
182
+ result = detector.predict(account_data)
183
+ print(result)
184
+ ```
185
+
186
+ ## Training Data
187
+
188
+ The model was trained on a comprehensive dataset of Twitter accounts with labeled bot/human classifications. The dataset includes:
189
+
190
+ - Balanced distribution of bot and human accounts
191
+ - Various bot types (spam bots, political bots, engagement bots, etc.)
192
+ - Diverse account types, ages, and activity levels
193
+ - Features extracted from public profile information
194
+
195
+ **Note**: The training data is proprietary and not included in this repository.
196
+
197
+ ## Training Procedure
198
+
199
+ ### Preprocessing
200
+
201
+ 1. Feature extraction from Twitter account profiles via API
202
+ 2. Calculation of derived features (FollowToFollowerRatio, AccountAge)
203
+ 3. Handling of missing values and outliers
204
+ 4. MinMax normalization of all features to [0, 1] range
205
+ 5. Train-test split with stratification to maintain class balance
206
+
207
+ ### Hyperparameters
208
+
209
+ - **Algorithm**: Random Forest Classifier
210
+ - **Version**: v2 (optimized)
211
+ - **Normalization**: MinMaxScaler
212
+ - **Cross-validation**: Stratified K-Fold
213
+ - **Feature Selection**: Based on domain knowledge and feature importance analysis
214
+
215
+ The model was trained using scikit-learn's RandomForestClassifier with optimized hyperparameters selected through extensive cross-validation and grid search.
216
+
217
+ ## Limitations and Bias
218
+
219
+ ### Limitations
220
+
221
+ - Model performance depends on the quality and accuracy of input features
222
+ - May not generalize to new bot patterns not seen during training
223
+ - Requires access to Twitter API for feature extraction
224
+ - Performance may degrade over time as bot behaviors evolve rapidly
225
+ - Limited to profile-level features; does not analyze tweet content deeply
226
+ - May struggle with sophisticated bots that mimic human behavior closely
227
+ - Requires regular updates due to platform changes (Twitter → X)
228
+
229
+ ### Potential Biases
230
+
231
+ - May be biased toward bot patterns present in the training data
232
+ - Could have temporal biases based on when training data was collected
233
+ - May misclassify legitimate accounts with unusual behavior patterns
234
+ - Potential bias against new accounts or accounts with low activity
235
+ - Could reflect biases in the original labeling process
236
+ - May have difficulty with non-English accounts if training data is primarily English
237
+
238
+ ### Recommendations
239
+
240
+ - Regularly retrain the model with new data to capture evolving bot patterns
241
+ - Use as part of a multi-layered detection system including content analysis
242
+ - Implement human review for high-stakes decisions
243
+ - Monitor for false positives and adjust classification thresholds based on use case
244
+ - Combine with tweet content analysis, network analysis, and temporal patterns
245
+ - Consider context (political events, trending topics) when interpreting results
246
+ - Validate performance across different account types and languages
247
+
248
+ ## Ethical Considerations
249
+
250
+ - This model should be used responsibly and not for harassment, doxxing, or targeting
251
+ - Consider privacy implications when analyzing user accounts
252
+ - Ensure compliance with Twitter/X's terms of service and relevant privacy laws (GDPR, CCPA, etc.)
253
+ - Implement appropriate safeguards against misuse
254
+ - Provide transparency to users about automated detection systems
255
+ - Allow for appeals and manual review processes
256
+ - Be aware of potential for false accusations
257
+ - Consider impact on freedom of speech and legitimate automated accounts (news bots, etc.)
258
+ - Monitor for discriminatory outcomes across different user groups
259
+
260
+ ## Known Issues
261
+
262
+ - Twitter's API changes may affect feature availability
263
+ - Platform rebranding (Twitter → X) may introduce new bot patterns
264
+ - Changes in verification system may affect IsVerified feature utility
265
+
266
+ ## Model Card Authors
267
+
268
+ This model card was created as part of the Bot Detection project for social media platforms.
269
+
270
+ ## Citation
271
+
272
+ If you use this model in your research, please cite:
273
+
274
+ ```bibtex
275
+ @misc{twitter_bot_detection_2024,
276
+ title={Twitter Bot Detection Model v2},
277
+ author={Your Name/Organization},
278
+ year={2024},
279
+ publisher={Hugging Face},
280
+ howpublished={\url{https://huggingface.co/your-username/twitter-bot-detection}}
281
+ }
282
+ ```
283
+
284
+ ## Related Models
285
+
286
+ - [TikTok Bot Detection](https://huggingface.co/your-username/tiktok-bot-detection)
287
+ - [Instagram Bot Detection](https://huggingface.co/your-username/instagram-bot-detection)
288
+
289
+ ## Contact
290
+
291
+ For questions or feedback about this model, please open an issue in the repository or contact the maintainers.
292
+
293
+ ## Updates and Maintenance
294
+
295
+ - **Version**: 2.0
296
+ - **Last Updated**: November 2024
297
+ - **Status**: Active
298
+
299
+ ### Changelog
300
+
301
+ - **v2.0**: Improved hyperparameters, better cross-validation, optimized for current Twitter/X platform
302
+ - **v1.0**: Initial release
303
+
304
+ ### Future Updates
305
+
306
+ Future updates may include:
307
+
308
+ - Improved feature engineering based on new platform features
309
+ - Additional training data with recent bot patterns
310
+ - Deep learning approaches for complex bot detection
311
+ - Integration of tweet content analysis (NLP features)
312
+ - Network graph analysis for coordinated bot detection
313
+ - Temporal pattern analysis
314
+ - Support for multilingual accounts
315
+ - Real-time feature extraction pipeline
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ scikit-learn>=1.3.0
2
+ pandas>=2.0.0
3
+ numpy>=1.24.0
4
+ joblib>=1.3.0