|
|
--- |
|
|
language: "en" |
|
|
license: "apache-2.0" |
|
|
library_name: "scikit-learn" |
|
|
tags: |
|
|
- "bot-detection" |
|
|
- "tiktok" |
|
|
- "classification" |
|
|
- "scikit-learn" |
|
|
- "random-forest" |
|
|
--- |
|
|
|
|
|
# TIKTOK Bot Detection Model |
|
|
|
|
|
## Overview |
|
|
|
|
|
This directory contains a trained Random Forest classifier for detecting bot accounts on Tiktok. |
|
|
|
|
|
**Model Version:** v2 |
|
|
**Training Date:** 2025-11-27 11:38:35 |
|
|
**Framework:** scikit-learn 1.5.2 |
|
|
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Performance |
|
|
|
|
|
### Final Metrics (Test Set) |
|
|
|
|
|
| Metric | Score | |
|
|
| --------------------- | --------------- | |
|
|
| **Accuracy** | 0.9295 (92.95%) | |
|
|
| **Precision** | 0.9330 (93.30%) | |
|
|
| **Recall** | 0.9489 (94.89%) | |
|
|
| **F1-Score** | 0.9408 (94.08%) | |
|
|
| **ROC-AUC** | 0.9754 (97.54%) | |
|
|
| **Average Precision** | 0.9820 (98.20%) | |
|
|
|
|
|
### Model Improvement |
|
|
|
|
|
- **Baseline ROC-AUC:** 0.9730 |
|
|
- **Tuned ROC-AUC:** 0.9754 |
|
|
- **Improvement:** 0.0024 (0.25%) |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Files |
|
|
|
|
|
| File | Description | |
|
|
| ----------------------------- | -------------------------------------- | |
|
|
| `tiktok_bot_detection_v2.pkl` | Trained Random Forest model | |
|
|
| `tiktok_scaler_v2.pkl` | MinMaxScaler for feature normalization | |
|
|
| `tiktok_features_v2.json` | List of features used by the model | |
|
|
| `tiktok_metrics_v2.txt` | Detailed performance metrics report | |
|
|
| `images/` | All visualization plots (13 images) | |
|
|
| `README.md` | This file | |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Dataset Information |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Training Samples:** 2,385 |
|
|
- **Test Samples:** 596 |
|
|
- **Total Samples:** 2,981 |
|
|
- **Number of Features:** 12 |
|
|
- **Cross-Validation Folds:** 5 |
|
|
- **Random State:** 42 |
|
|
|
|
|
### Class Distribution |
|
|
|
|
|
**Training Set:** |
|
|
|
|
|
- Human (0): 951 (39.87%) |
|
|
- Bot (1): 1,434 (60.13%) |
|
|
|
|
|
**Test Set:** |
|
|
|
|
|
- Human (0): 244 (40.94%) |
|
|
- Bot (1): 352 (59.06%) |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Features (12) |
|
|
|
|
|
1. `IsPrivate` |
|
|
2. `IsVerified` |
|
|
3. `HasProfilePic` |
|
|
4. `FollowingCount` |
|
|
5. `FollowerCount` |
|
|
6. `HasInstagram` |
|
|
7. `HasYoutube` |
|
|
8. `HasBio` |
|
|
9. `HasLinkInBio` |
|
|
10. `HasPosts` |
|
|
11. `PostsCount` |
|
|
12. `FollowToFollowerRatio` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Top 5 Most Important Features |
|
|
|
|
|
12. **FollowToFollowerRatio** - 0.2693 |
|
|
13. **FollowerCount** - 0.1753 |
|
|
14. **HasInstagram** - 0.1499 |
|
|
15. **FollowingCount** - 0.1236 |
|
|
16. **PostsCount** - 0.1174 |
|
|
|
|
|
--- |
|
|
|
|
|
## βοΈ Hyperparameters |
|
|
|
|
|
### Best Parameters (from GridSearchCV) |
|
|
|
|
|
- **class_weight:** None |
|
|
- **max_depth:** 13 |
|
|
- **max_features:** sqrt |
|
|
- **min_samples_leaf:** 2 |
|
|
- **min_samples_split:** 10 |
|
|
- **n_estimators:** 100 |
|
|
|
|
|
### Parameter Search Space |
|
|
|
|
|
- **n_estimators:** [100, 200, 300] |
|
|
- **max_depth:** [10, 15, 20, None] |
|
|
- **min_samples_split:** [2, 5, 10] |
|
|
- **min_samples_leaf:** [1, 2, 4] |
|
|
- **max_features:** ['sqrt', 'log2'] |
|
|
- **bootstrap:** [True, False] |
|
|
|
|
|
**Total combinations tested:** 540 |
|
|
|
|
|
--- |
|
|
|
|
|
## π Cross-Validation Results |
|
|
|
|
|
### Mean Scores (5-Fold Stratified CV) |
|
|
|
|
|
- **Accuracy:** 0.9191 (Β±0.0097) |
|
|
- **Precision:** 0.9326 (Β±0.0115) |
|
|
- **Recall:** 0.9331 (Β±0.0166) |
|
|
- **F1-Score:** 0.9327 (Β±0.0083) |
|
|
- **ROC-AUC:** 0.9744 (Β±0.0055) |
|
|
|
|
|
--- |
|
|
|
|
|
## πΌοΈ Visualizations |
|
|
|
|
|
All visualizations are saved in the `images/` directory: |
|
|
|
|
|
1. **01_class_distribution.png** - Training/Test set class distribution |
|
|
2. **02_feature_correlation.png** - Feature correlation with target variable |
|
|
3. **03_correlation_matrix.png** - Feature correlation heatmap |
|
|
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix |
|
|
5. **05_baseline_roc_curve.png** - Baseline ROC curve |
|
|
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve |
|
|
7. **07_baseline_feature_importance.png** - Baseline feature importance |
|
|
8. **08_cross_validation.png** - Cross-validation score distribution |
|
|
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix |
|
|
10. **10_tuned_roc_curve.png** - Tuned ROC curve |
|
|
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve |
|
|
12. **12_tuned_feature_importance.png** - Tuned feature importance |
|
|
13. **13_model_comparison.png** - Baseline vs Tuned comparison |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage Example |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
import pandas as pd |
|
|
import numpy as np |
|
|
|
|
|
# Load model and scaler |
|
|
model = joblib.load('tiktok_bot_detection_v2.pkl') |
|
|
scaler = joblib.load('tiktok_scaler_v2.pkl') |
|
|
|
|
|
# Prepare your data (example) |
|
|
data = { |
|
|
'IsPrivate': 0.5, |
|
|
'IsVerified': 0.5, |
|
|
'HasProfilePic': 0.5, |
|
|
'FollowingCount': 0.5, |
|
|
'FollowerCount': 0.5, |
|
|
'HasInstagram': 0.5, |
|
|
'HasYoutube': 0.5, |
|
|
'HasBio': 0.5, |
|
|
'HasLinkInBio': 0.5, |
|
|
'HasPosts': 0.5, |
|
|
'PostsCount': 0.5, |
|
|
'FollowToFollowerRatio': 0.5, |
|
|
} |
|
|
|
|
|
# Create DataFrame |
|
|
df = pd.DataFrame([data]) |
|
|
|
|
|
# Scale features |
|
|
df_scaled = scaler.transform(df) |
|
|
|
|
|
# Predict |
|
|
prediction = model.predict(df_scaled)[0] |
|
|
probability = model.predict_proba(df_scaled)[0] |
|
|
|
|
|
print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}") |
|
|
print(f"Bot Probability: {probability[1]:.4f}") |
|
|
print(f"Human Probability: {probability[0]:.4f}") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Confusion Matrix Breakdown |
|
|
|
|
|
### Tuned Model (Test Set) |
|
|
|
|
|
``` |
|
|
Predicted |
|
|
Human Bot |
|
|
Actual Human 220 24 |
|
|
Bot 18 334 |
|
|
``` |
|
|
|
|
|
- **True Negatives (TN):** 220 (Correctly identified humans) |
|
|
- **False Positives (FP):** 24 (Humans incorrectly classified as bots) |
|
|
- **False Negatives (FN):** 18 (Bots incorrectly classified as humans) |
|
|
- **True Positives (TP):** 334 (Correctly identified bots) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Interpretation |
|
|
|
|
|
### Strengths |
|
|
|
|
|
- High ROC-AUC score (0.9754) indicates excellent discrimination capability |
|
|
- Balanced precision and recall for both classes |
|
|
- Robust cross-validation performance |
|
|
|
|
|
### Key Insights |
|
|
|
|
|
1. Top features drive bot classification effectively |
|
|
2. GridSearchCV improved performance over baseline by 0.25% |
|
|
3. Model generalizes well on unseen test data |
|
|
|
|
|
--- |
|
|
|
|
|
## π Notes |
|
|
|
|
|
- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range |
|
|
- **Missing Values:** Filled with 0 during preprocessing |
|
|
- **Class Balance:** Imbalanced dataset |
|
|
- **Model Type:** Ensemble method resistant to overfitting |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Updates |
|
|
|
|
|
To retrain the model: |
|
|
|
|
|
1. Place new training data in `../data/train_tiktok.csv` |
|
|
2. Run the training notebook: `5_enhanced_training.ipynb` |
|
|
3. Update this README with new metrics |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Contact & Support |
|
|
|
|
|
For questions or issues regarding this model, please refer to the main project documentation. |
|
|
|
|
|
--- |
|
|
|
|
|
**Generated:** 2025-11-27 11:38:35 |
|
|
**Notebook:** `5_enhanced_training.ipynb` |
|
|
**Platform:** Tiktok |
|
|
|