nahiar's picture
Upload folder using huggingface_hub
7869c22 verified
---
language: "en"
license: "apache-2.0"
library_name: "scikit-learn"
tags:
- "bot-detection"
- "tiktok"
- "classification"
- "scikit-learn"
- "random-forest"
---
# TIKTOK Bot Detection Model
## Overview
This directory contains a trained Random Forest classifier for detecting bot accounts on Tiktok.
**Model Version:** v2
**Training Date:** 2025-11-27 11:38:35
**Framework:** scikit-learn 1.5.2
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
---
## πŸ“Š Model Performance
### Final Metrics (Test Set)
| Metric | Score |
| --------------------- | --------------- |
| **Accuracy** | 0.9295 (92.95%) |
| **Precision** | 0.9330 (93.30%) |
| **Recall** | 0.9489 (94.89%) |
| **F1-Score** | 0.9408 (94.08%) |
| **ROC-AUC** | 0.9754 (97.54%) |
| **Average Precision** | 0.9820 (98.20%) |
### Model Improvement
- **Baseline ROC-AUC:** 0.9730
- **Tuned ROC-AUC:** 0.9754
- **Improvement:** 0.0024 (0.25%)
---
## πŸ—‚οΈ Files
| File | Description |
| ----------------------------- | -------------------------------------- |
| `tiktok_bot_detection_v2.pkl` | Trained Random Forest model |
| `tiktok_scaler_v2.pkl` | MinMaxScaler for feature normalization |
| `tiktok_features_v2.json` | List of features used by the model |
| `tiktok_metrics_v2.txt` | Detailed performance metrics report |
| `images/` | All visualization plots (13 images) |
| `README.md` | This file |
---
## 🎯 Dataset Information
### Training Configuration
- **Training Samples:** 2,385
- **Test Samples:** 596
- **Total Samples:** 2,981
- **Number of Features:** 12
- **Cross-Validation Folds:** 5
- **Random State:** 42
### Class Distribution
**Training Set:**
- Human (0): 951 (39.87%)
- Bot (1): 1,434 (60.13%)
**Test Set:**
- Human (0): 244 (40.94%)
- Bot (1): 352 (59.06%)
---
## πŸ”§ Features (12)
1. `IsPrivate`
2. `IsVerified`
3. `HasProfilePic`
4. `FollowingCount`
5. `FollowerCount`
6. `HasInstagram`
7. `HasYoutube`
8. `HasBio`
9. `HasLinkInBio`
10. `HasPosts`
11. `PostsCount`
12. `FollowToFollowerRatio`
---
## πŸ† Top 5 Most Important Features
12. **FollowToFollowerRatio** - 0.2693
13. **FollowerCount** - 0.1753
14. **HasInstagram** - 0.1499
15. **FollowingCount** - 0.1236
16. **PostsCount** - 0.1174
---
## βš™οΈ Hyperparameters
### Best Parameters (from GridSearchCV)
- **class_weight:** None
- **max_depth:** 13
- **max_features:** sqrt
- **min_samples_leaf:** 2
- **min_samples_split:** 10
- **n_estimators:** 100
### Parameter Search Space
- **n_estimators:** [100, 200, 300]
- **max_depth:** [10, 15, 20, None]
- **min_samples_split:** [2, 5, 10]
- **min_samples_leaf:** [1, 2, 4]
- **max_features:** ['sqrt', 'log2']
- **bootstrap:** [True, False]
**Total combinations tested:** 540
---
## πŸ“ˆ Cross-Validation Results
### Mean Scores (5-Fold Stratified CV)
- **Accuracy:** 0.9191 (Β±0.0097)
- **Precision:** 0.9326 (Β±0.0115)
- **Recall:** 0.9331 (Β±0.0166)
- **F1-Score:** 0.9327 (Β±0.0083)
- **ROC-AUC:** 0.9744 (Β±0.0055)
---
## πŸ–ΌοΈ Visualizations
All visualizations are saved in the `images/` directory:
1. **01_class_distribution.png** - Training/Test set class distribution
2. **02_feature_correlation.png** - Feature correlation with target variable
3. **03_correlation_matrix.png** - Feature correlation heatmap
4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
5. **05_baseline_roc_curve.png** - Baseline ROC curve
6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
7. **07_baseline_feature_importance.png** - Baseline feature importance
8. **08_cross_validation.png** - Cross-validation score distribution
9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
10. **10_tuned_roc_curve.png** - Tuned ROC curve
11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
12. **12_tuned_feature_importance.png** - Tuned feature importance
13. **13_model_comparison.png** - Baseline vs Tuned comparison
---
## πŸš€ Usage Example
```python
import joblib
import pandas as pd
import numpy as np
# Load model and scaler
model = joblib.load('tiktok_bot_detection_v2.pkl')
scaler = joblib.load('tiktok_scaler_v2.pkl')
# Prepare your data (example)
data = {
'IsPrivate': 0.5,
'IsVerified': 0.5,
'HasProfilePic': 0.5,
'FollowingCount': 0.5,
'FollowerCount': 0.5,
'HasInstagram': 0.5,
'HasYoutube': 0.5,
'HasBio': 0.5,
'HasLinkInBio': 0.5,
'HasPosts': 0.5,
'PostsCount': 0.5,
'FollowToFollowerRatio': 0.5,
}
# Create DataFrame
df = pd.DataFrame([data])
# Scale features
df_scaled = scaler.transform(df)
# Predict
prediction = model.predict(df_scaled)[0]
probability = model.predict_proba(df_scaled)[0]
print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
print(f"Bot Probability: {probability[1]:.4f}")
print(f"Human Probability: {probability[0]:.4f}")
```
---
## πŸ“‹ Confusion Matrix Breakdown
### Tuned Model (Test Set)
```
Predicted
Human Bot
Actual Human 220 24
Bot 18 334
```
- **True Negatives (TN):** 220 (Correctly identified humans)
- **False Positives (FP):** 24 (Humans incorrectly classified as bots)
- **False Negatives (FN):** 18 (Bots incorrectly classified as humans)
- **True Positives (TP):** 334 (Correctly identified bots)
---
## πŸ” Model Interpretation
### Strengths
- High ROC-AUC score (0.9754) indicates excellent discrimination capability
- Balanced precision and recall for both classes
- Robust cross-validation performance
### Key Insights
1. Top features drive bot classification effectively
2. GridSearchCV improved performance over baseline by 0.25%
3. Model generalizes well on unseen test data
---
## πŸ“ Notes
- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
- **Missing Values:** Filled with 0 during preprocessing
- **Class Balance:** Imbalanced dataset
- **Model Type:** Ensemble method resistant to overfitting
---
## πŸ”„ Model Updates
To retrain the model:
1. Place new training data in `../data/train_tiktok.csv`
2. Run the training notebook: `5_enhanced_training.ipynb`
3. Update this README with new metrics
---
## πŸ“§ Contact & Support
For questions or issues regarding this model, please refer to the main project documentation.
---
**Generated:** 2025-11-27 11:38:35
**Notebook:** `5_enhanced_training.ipynb`
**Platform:** Tiktok