nahiar commited on
Commit
6d16d09
·
verified ·
1 Parent(s): c03aefb

Initial upload (auto-create if missing)

Browse files
README.md ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "en"
3
+ license: "apache-2.0"
4
+ library_name: "scikit-learn"
5
+ tags:
6
+ - "bot-detection"
7
+ - "tiktok"
8
+ - "classification"
9
+ - "scikit-learn"
10
+ - "random-forest"
11
+ ---
12
+
13
+ # TIKTOK Bot Detection Model
14
+
15
+ ## Overview
16
+
17
+ This directory contains a trained Random Forest classifier for detecting bot accounts on Tiktok.
18
+
19
+ **Model Version:** v2
20
+ **Training Date:** 2025-12-30 11:38:35
21
+ **Framework:** scikit-learn 1.5.2
22
+ **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
23
+
24
+ ---
25
+
26
+ ## 📊 Model Performance
27
+
28
+ ### Final Metrics (Test Set)
29
+
30
+ | Metric | Score |
31
+ | --------------------- | --------------- |
32
+ | **Accuracy** | 0.9224 (92.24%) |
33
+ | **Precision** | 0.9596 (95.96%) |
34
+ | **Recall** | 0.9094 (90.94%) |
35
+ | **F1-Score** | 0.9338 (93.38%) |
36
+ | **ROC-AUC** | 0.9773 (97.73%) |
37
+ | **Average Precision** | 0.9596 (95.96%) |
38
+
39
+ ### Model Improvement
40
+
41
+ - **Baseline ROC-AUC:** 0.9759
42
+ - **Tuned ROC-AUC:** 0.9773
43
+ - **Improvement:** 0.0014 (0.14%)
44
+
45
+ ---
46
+
47
+ ## 🗂️ Files
48
+
49
+ | File | Description |
50
+ | ----------------------------- | -------------------------------------- |
51
+ | `tiktok_bot_detection_v2.pkl` | Trained Random Forest model |
52
+ | `tiktok_scaler_v2.pkl` | MinMaxScaler for feature normalization |
53
+ | `tiktok_features_v2.json` | List of features used by the model |
54
+ | `tiktok_metrics_v2.txt` | Detailed performance metrics report |
55
+ | `images/` | All visualization plots (13 images) |
56
+ | `README.md` | This file |
57
+
58
+ ---
59
+
60
+ ## 🎯 Dataset Information
61
+
62
+ ### Training Configuration
63
+
64
+ - **Training Samples:** 2,385
65
+ - **Test Samples:** 596
66
+ - **Total Samples:** 2,981
67
+ - **Number of Features:** 12
68
+ - **Cross-Validation Folds:** 5
69
+ - **Random State:** 42
70
+
71
+ ### Class Distribution
72
+
73
+ **Training Set:**
74
+
75
+ - Human (0): 951 (39.87%)
76
+ - Bot (1): 1,434 (60.13%)
77
+
78
+ **Test Set:**
79
+
80
+ - Human (0): 244 (40.94%)
81
+ - Bot (1): 352 (59.06%)
82
+
83
+ ---
84
+
85
+ ## 🔧 Features (13)
86
+
87
+ 1. `IsPrivate`
88
+ 2. `IsVerified`
89
+ 3. `HasProfilePic`
90
+ 4. `FollowingCount`
91
+ 5. `FollowerCount`
92
+ 6. `LikesCount`
93
+ 7. `HasInstagram`
94
+ 8. `HasYoutube`
95
+ 9. `HasBio`
96
+ 10. `HasLinkInBio`
97
+ 11. `HasPosts`
98
+ 12. `PostsCount`
99
+ 13. `FollowToFollowerRatio`
100
+
101
+ ---
102
+
103
+ ## 🏆 Top 5 Most Important Features
104
+
105
+ 12. **FollowToFollowerRatio** - 0.2330
106
+ 13. **LikesCount** - 0.1771
107
+ 14. **HasInstagram** - 0.1395
108
+ 15. **FollowingCount** - 0.1349
109
+ 16. **FollowerCount** - 0.1055
110
+
111
+ ---
112
+
113
+ ## ⚙️ Hyperparameters
114
+
115
+ ### Best Parameters (from GridSearchCV)
116
+
117
+ - **class_weight:** None
118
+ - **max_depth:** 13
119
+ - **max_features:** sqrt
120
+ - **min_samples_leaf:** 2
121
+ - **min_samples_split:** 10
122
+ - **n_estimators:** 100
123
+
124
+ ### Parameter Search Space
125
+
126
+ - **n_estimators:** [100, 200, 300]
127
+ - **max_depth:** [10, 15, 20, None]
128
+ - **min_samples_split:** [2, 5, 10]
129
+ - **min_samples_leaf:** [1, 2, 4]
130
+ - **max_features:** ['sqrt', 'log2']
131
+ - **bootstrap:** [True, False]
132
+
133
+ **Total combinations tested:** 540
134
+
135
+ ---
136
+
137
+ ## 📈 Cross-Validation Results
138
+
139
+ ### Mean Scores (5-Fold Stratified CV)
140
+
141
+ - **Accuracy:** 0.9191 (±0.0097)
142
+ - **Precision:** 0.9326 (±0.0115)
143
+ - **Recall:** 0.9331 (±0.0166)
144
+ - **F1-Score:** 0.9327 (±0.0083)
145
+ - **ROC-AUC:** 0.9744 (±0.0055)
146
+
147
+ ---
148
+
149
+ ## 🖼️ Visualizations
150
+
151
+ All visualizations are saved in the `images/` directory:
152
+
153
+ 1. **01_class_distribution.png** - Training/Test set class distribution
154
+ 2. **02_feature_correlation.png** - Feature correlation with target variable
155
+ 3. **03_correlation_matrix.png** - Feature correlation heatmap
156
+ 4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
157
+ 5. **05_baseline_roc_curve.png** - Baseline ROC curve
158
+ 6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
159
+ 7. **07_baseline_feature_importance.png** - Baseline feature importance
160
+ 8. **08_cross_validation.png** - Cross-validation score distribution
161
+ 9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
162
+ 10. **10_tuned_roc_curve.png** - Tuned ROC curve
163
+ 11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
164
+ 12. **12_tuned_feature_importance.png** - Tuned feature importance
165
+ 13. **13_model_comparison.png** - Baseline vs Tuned comparison
166
+
167
+ ---
168
+
169
+ ## 🚀 Usage Example
170
+
171
+ ```python
172
+ import joblib
173
+ import pandas as pd
174
+ import numpy as np
175
+
176
+ # Load model and scaler
177
+ model = joblib.load('tiktok_bot_detection_v2.pkl')
178
+ scaler = joblib.load('tiktok_scaler_v2.pkl')
179
+
180
+ # Prepare your data (example)
181
+ data = {
182
+ 'IsPrivate': 0.5,
183
+ 'IsVerified': 0.5,
184
+ 'HasProfilePic': 0.5,
185
+ 'FollowingCount': 0.5,
186
+ 'FollowerCount': 0.5,
187
+ 'LikesCount': 0.5,
188
+ 'HasInstagram': 0.5,
189
+ 'HasYoutube': 0.5,
190
+ 'HasBio': 0.5,
191
+ 'HasLinkInBio': 0.5,
192
+ 'HasPosts': 0.5,
193
+ 'PostsCount': 0.5,
194
+ 'FollowToFollowerRatio': 0.5,
195
+ }
196
+
197
+ # Create DataFrame
198
+ df = pd.DataFrame([data])
199
+
200
+ # Scale features
201
+ df_scaled = scaler.transform(df)
202
+
203
+ # Predict
204
+ prediction = model.predict(df_scaled)[0]
205
+ probability = model.predict_proba(df_scaled)[0]
206
+
207
+ print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
208
+ print(f"Bot Probability: {probability[1]:.4f}")
209
+ print(f"Human Probability: {probability[0]:.4f}")
210
+ ```
211
+
212
+ ---
213
+
214
+ ## 📋 Confusion Matrix Breakdown
215
+
216
+ ### Tuned Model (Test Set)
217
+
218
+ ```
219
+ Predicted
220
+ Human Bot
221
+ Actual Human 220 24
222
+ Bot 18 334
223
+ ```
224
+
225
+ - **True Negatives (TN):** 220 (Correctly identified humans)
226
+ - **False Positives (FP):** 24 (Humans incorrectly classified as bots)
227
+ - **False Negatives (FN):** 18 (Bots incorrectly classified as humans)
228
+ - **True Positives (TP):** 334 (Correctly identified bots)
229
+
230
+ ---
231
+
232
+ ## 🔍 Model Interpretation
233
+
234
+ ### Strengths
235
+
236
+ - High ROC-AUC score (0.9754) indicates excellent discrimination capability
237
+ - Balanced precision and recall for both classes
238
+ - Robust cross-validation performance
239
+
240
+ ### Key Insights
241
+
242
+ 1. Top features drive bot classification effectively
243
+ 2. GridSearchCV improved performance over baseline by 0.25%
244
+ 3. Model generalizes well on unseen test data
245
+
246
+ ---
247
+
248
+ ## 📝 Notes
249
+
250
+ - **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
251
+ - **Missing Values:** Filled with 0 during preprocessing
252
+ - **Class Balance:** Imbalanced dataset
253
+ - **Model Type:** Ensemble method resistant to overfitting
254
+
255
+ ---
256
+
257
+ ## 🔄 Model Updates
258
+
259
+ To retrain the model:
260
+
261
+ 1. Place new training data in `../data/train_tiktok.csv`
262
+ 2. Run the training notebook: `5_enhanced_training.ipynb`
263
+ 3. Update this README with new metrics
264
+
265
+ ---
266
+
267
+ ## 📧 Contact & Support
268
+
269
+ For questions or issues regarding this model, please refer to the main project documentation.
270
+
271
+ ---
272
+
273
+ **Generated:** 2025-12-30 11:38:35
274
+ **Notebook:** `5_enhanced_training.ipynb`
275
+ **Platform:** Tiktok
images/01_class_distribution.png ADDED
images/02_future_correlation.png ADDED
images/03_correlation_matrix.png ADDED
images/04_baseline_confussion_matrix.png ADDED
images/05_baseline_roc_curve.png ADDED
images/06_baseline_precision_recall.png ADDED
images/07_baseline_feture_important.png ADDED
images/08_cross_validation.png ADDED
images/09_tuned_confussion_matrix.png ADDED
images/10_tuned_roc_curve.png ADDED
images/11_tuned_precision_recall.png ADDED
images/12_model_comparison.png ADDED
inference_example.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Example inference script for TikTok Bot Detection Model
3
+ """
4
+
5
+ import joblib
6
+ import pandas as pd
7
+ from sklearn.preprocessing import MinMaxScaler
8
+
9
+
10
+ def load_model(model_path="TIKTOK_BOT_Detection_Model_v1.pkl"):
11
+ """Load the trained bot detection model"""
12
+ return joblib.load(model_path)
13
+
14
+
15
+ def prepare_features(account_data):
16
+ """
17
+ Prepare account features for prediction
18
+
19
+ Args:
20
+ account_data (dict): Dictionary containing account features
21
+
22
+ Returns:
23
+ numpy.ndarray: Scaled features ready for prediction
24
+ """
25
+ features = [
26
+ "IsPrivate",
27
+ "IsVerified",
28
+ "HasProfilePic",
29
+ "FollowingCount",
30
+ "FollowerCount",
31
+ "HasInstagram",
32
+ "HasYoutube",
33
+ "HasBio",
34
+ "HasLinkInBio",
35
+ "HasPosts",
36
+ "PostsCount",
37
+ "FollowToFollowerRatio",
38
+ ]
39
+
40
+ df = pd.DataFrame([account_data])
41
+
42
+ # Scale features
43
+ scaler = MinMaxScaler()
44
+ df_scaled = scaler.fit_transform(df[features])
45
+
46
+ return df_scaled
47
+
48
+
49
+ def predict_single_account(model, account_data):
50
+ """
51
+ Predict if a single account is a bot
52
+
53
+ Args:
54
+ model: Trained sklearn model
55
+ account_data (dict): Account features
56
+
57
+ Returns:
58
+ dict: Prediction results with probabilities
59
+ """
60
+ features_scaled = prepare_features(account_data)
61
+
62
+ prediction = model.predict(features_scaled)[0]
63
+ probability = model.predict_proba(features_scaled)[0]
64
+
65
+ return {
66
+ "is_bot": bool(prediction),
67
+ "bot_probability": float(probability[1]),
68
+ "human_probability": float(probability[0]),
69
+ "confidence": float(max(probability)),
70
+ }
71
+
72
+
73
+ def predict_batch(model, accounts_df):
74
+ """
75
+ Predict for multiple accounts at once
76
+
77
+ Args:
78
+ model: Trained sklearn model
79
+ accounts_df (pd.DataFrame): DataFrame with account features
80
+
81
+ Returns:
82
+ pd.DataFrame: Original data with predictions added
83
+ """
84
+ features = [
85
+ "IsPrivate",
86
+ "IsVerified",
87
+ "HasProfilePic",
88
+ "FollowingCount",
89
+ "FollowerCount",
90
+ "HasInstagram",
91
+ "HasYoutube",
92
+ "HasBio",
93
+ "HasLinkInBio",
94
+ "HasPosts",
95
+ "PostsCount",
96
+ "FollowToFollowerRatio",
97
+ ]
98
+
99
+ scaler = MinMaxScaler()
100
+ features_scaled = scaler.fit_transform(accounts_df[features])
101
+
102
+ predictions = model.predict(features_scaled)
103
+ probabilities = model.predict_proba(features_scaled)
104
+
105
+ accounts_df["is_bot"] = predictions
106
+ accounts_df["bot_probability"] = probabilities[:, 1]
107
+ accounts_df["human_probability"] = probabilities[:, 0]
108
+
109
+ return accounts_df
110
+
111
+
112
+ # Example usage
113
+ if __name__ == "__main__":
114
+ # Load model
115
+ print("Loading TikTok bot detection model...")
116
+ model = load_model()
117
+ print("✓ Model loaded successfully!\n")
118
+
119
+ # Example 1: Single account prediction
120
+ print("=" * 60)
121
+ print("Example 1: Single Account Prediction")
122
+ print("=" * 60)
123
+
124
+ suspicious_account = {
125
+ "IsPrivate": 0,
126
+ "IsVerified": 0,
127
+ "HasProfilePic": 1,
128
+ "FollowingCount": 5000,
129
+ "FollowerCount": 100,
130
+ "HasInstagram": 0,
131
+ "HasYoutube": 0,
132
+ "HasBio": 0,
133
+ "HasLinkInBio": 1,
134
+ "HasPosts": 1,
135
+ "PostsCount": 50,
136
+ "FollowToFollowerRatio": 50.0,
137
+ }
138
+
139
+ result = predict_single_account(model, suspicious_account)
140
+
141
+ print(f"Account Analysis:")
142
+ print(f" Following: {suspicious_account['FollowingCount']}")
143
+ print(f" Followers: {suspicious_account['FollowerCount']}")
144
+ print(f" Posts: {suspicious_account['PostsCount']}")
145
+ print(f"\nPrediction:")
146
+ print(f" Is Bot: {result['is_bot']}")
147
+ print(f" Bot Probability: {result['bot_probability']:.2%}")
148
+ print(f" Confidence: {result['confidence']:.2%}")
149
+
150
+ # Example 2: Batch prediction
151
+ print(f"\n{'='*60}")
152
+ print("Example 2: Batch Prediction")
153
+ print("=" * 60)
154
+
155
+ accounts = pd.DataFrame(
156
+ [
157
+ {
158
+ "IsPrivate": 0,
159
+ "IsVerified": 1,
160
+ "HasProfilePic": 1,
161
+ "FollowingCount": 500,
162
+ "FollowerCount": 10000,
163
+ "HasInstagram": 1,
164
+ "HasYoutube": 1,
165
+ "HasBio": 1,
166
+ "HasLinkInBio": 1,
167
+ "HasPosts": 1,
168
+ "PostsCount": 200,
169
+ "FollowToFollowerRatio": 0.05,
170
+ },
171
+ {
172
+ "IsPrivate": 0,
173
+ "IsVerified": 0,
174
+ "HasProfilePic": 0,
175
+ "FollowingCount": 8000,
176
+ "FollowerCount": 50,
177
+ "HasInstagram": 0,
178
+ "HasYoutube": 0,
179
+ "HasBio": 0,
180
+ "HasLinkInBio": 1,
181
+ "HasPosts": 1,
182
+ "PostsCount": 10,
183
+ "FollowToFollowerRatio": 160.0,
184
+ },
185
+ ]
186
+ )
187
+
188
+ results = predict_batch(model, accounts.copy())
189
+
190
+ print("\nResults:")
191
+ for idx, row in results.iterrows():
192
+ print(f"\nAccount {idx + 1}:")
193
+ print(f" Followers: {row['FollowerCount']}")
194
+ print(f" Is Bot: {bool(row['is_bot'])}")
195
+ print(f" Bot Probability: {row['bot_probability']:.2%}")
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ scikit-learn>=1.7.2
2
+ pandas>=2.0.0
3
+ numpy>=1.24.0
4
+ joblib>=1.3.0
tiktok_bot_detection.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8b70c4f9f2da43c1a55cf5beaa7f412133c53a994f0f5e2ae555fec43657b5b
3
+ size 5917753
tiktok_features.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ "IsPrivate",
3
+ "IsVerified",
4
+ "HasProfilePic",
5
+ "FollowingCount",
6
+ "FollowerCount",
7
+ "LikesCount",
8
+ "HasInstagram",
9
+ "HasYoutube",
10
+ "HasBio",
11
+ "HasLinkInBio",
12
+ "HasPosts",
13
+ "PostsCount",
14
+ "FollowToFollowerRatio"
15
+ ]
tiktok_features.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ IsPrivate
2
+ IsVerified
3
+ HasProfilePic
4
+ FollowingCount
5
+ FollowerCount
6
+ LikesCount
7
+ HasInstagram
8
+ HasYoutube
9
+ HasBio
10
+ HasLinkInBio
11
+ HasPosts
12
+ PostsCount
13
+ FollowToFollowerRatio
tiktok_metrics.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model: Tiktok Bot Detection
2
+ Date: 2026-01-07 14:47:41.866486
3
+
4
+ ============================================================
5
+ Performance Metrics
6
+ ============================================================
7
+
8
+ Accuracy: 0.9224
9
+ Precision: 0.9596
10
+ Recall: 0.9094
11
+ F1: 0.9338
12
+ Roc_auc: 0.9773
13
+ Avg_precision: 0.9844
14
+
15
+ Best Parameters:
16
+ class_weight: balanced
17
+ max_depth: 30
18
+ max_features: sqrt
19
+ min_samples_leaf: 1
20
+ min_samples_split: 5
21
+ n_estimators: 300
22
+
23
+ Cross-Validation ROC-AUC: 0.9751 (+/- 0.0205)
tiktok_model_comparison.csv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Metric,Baseline,Tuned,Improvement,Improvement %
2
+ Accuracy,0.924496644295302,0.9295302013422819,0.005033557046979942,0.5444646098003711
3
+ Precision,0.9299719887955182,0.9329608938547486,0.002988905059230329,0.32139732112808056
4
+ Recall,0.9431818181818182,0.9488636363636364,0.005681818181818121,0.6024096385542104
5
+ F1-Score,0.9365303244005642,0.9408450704225352,0.0043147460219710165,0.46071610385202577
6
+ ROC-AUC,0.9729531482861401,0.9753807283904621,0.0024275801043219802,0.24950637228505504
7
+ Avg Precision,0.9811620379557393,0.9820393653516898,0.0008773273959504779,0.08941717698112313
tiktok_scaler.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e7768bfc30d959fda3f4cea858c95e8896d035dd77847ee96dc1e582a36d5a4e
3
+ size 1631