--- language: "en" license: "apache-2.0" library_name: "scikit-learn" tags: - "bot-detection" - "tiktok" - "classification" - "scikit-learn" - "random-forest" --- # TIKTOK Bot Detection Model ## Overview This directory contains a trained Random Forest classifier for detecting bot accounts on Tiktok. **Model Version:** v2 **Training Date:** 2025-11-27 11:38:35 **Framework:** scikit-learn 1.5.2 **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning --- ## 📊 Model Performance ### Final Metrics (Test Set) | Metric | Score | | --------------------- | --------------- | | **Accuracy** | 0.9295 (92.95%) | | **Precision** | 0.9330 (93.30%) | | **Recall** | 0.9489 (94.89%) | | **F1-Score** | 0.9408 (94.08%) | | **ROC-AUC** | 0.9754 (97.54%) | | **Average Precision** | 0.9820 (98.20%) | ### Model Improvement - **Baseline ROC-AUC:** 0.9730 - **Tuned ROC-AUC:** 0.9754 - **Improvement:** 0.0024 (0.25%) --- ## 🗂️ Files | File | Description | | ----------------------------- | -------------------------------------- | | `tiktok_bot_detection_v2.pkl` | Trained Random Forest model | | `tiktok_scaler_v2.pkl` | MinMaxScaler for feature normalization | | `tiktok_features_v2.json` | List of features used by the model | | `tiktok_metrics_v2.txt` | Detailed performance metrics report | | `images/` | All visualization plots (13 images) | | `README.md` | This file | --- ## 🎯 Dataset Information ### Training Configuration - **Training Samples:** 2,385 - **Test Samples:** 596 - **Total Samples:** 2,981 - **Number of Features:** 12 - **Cross-Validation Folds:** 5 - **Random State:** 42 ### Class Distribution **Training Set:** - Human (0): 951 (39.87%) - Bot (1): 1,434 (60.13%) **Test Set:** - Human (0): 244 (40.94%) - Bot (1): 352 (59.06%) --- ## 🔧 Features (12) 1. `IsPrivate` 2. `IsVerified` 3. `HasProfilePic` 4. `FollowingCount` 5. `FollowerCount` 6. `HasInstagram` 7. `HasYoutube` 8. `HasBio` 9. `HasLinkInBio` 10. `HasPosts` 11. `PostsCount` 12. `FollowToFollowerRatio` --- ## 🏆 Top 5 Most Important Features 12. **FollowToFollowerRatio** - 0.2693 13. **FollowerCount** - 0.1753 14. **HasInstagram** - 0.1499 15. **FollowingCount** - 0.1236 16. **PostsCount** - 0.1174 --- ## ⚙️ Hyperparameters ### Best Parameters (from GridSearchCV) - **class_weight:** None - **max_depth:** 13 - **max_features:** sqrt - **min_samples_leaf:** 2 - **min_samples_split:** 10 - **n_estimators:** 100 ### Parameter Search Space - **n_estimators:** [100, 200, 300] - **max_depth:** [10, 15, 20, None] - **min_samples_split:** [2, 5, 10] - **min_samples_leaf:** [1, 2, 4] - **max_features:** ['sqrt', 'log2'] - **bootstrap:** [True, False] **Total combinations tested:** 540 --- ## 📈 Cross-Validation Results ### Mean Scores (5-Fold Stratified CV) - **Accuracy:** 0.9191 (±0.0097) - **Precision:** 0.9326 (±0.0115) - **Recall:** 0.9331 (±0.0166) - **F1-Score:** 0.9327 (±0.0083) - **ROC-AUC:** 0.9744 (±0.0055) --- ## 🖼️ Visualizations All visualizations are saved in the `images/` directory: 1. **01_class_distribution.png** - Training/Test set class distribution 2. **02_feature_correlation.png** - Feature correlation with target variable 3. **03_correlation_matrix.png** - Feature correlation heatmap 4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix 5. **05_baseline_roc_curve.png** - Baseline ROC curve 6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve 7. **07_baseline_feature_importance.png** - Baseline feature importance 8. **08_cross_validation.png** - Cross-validation score distribution 9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix 10. **10_tuned_roc_curve.png** - Tuned ROC curve 11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve 12. **12_tuned_feature_importance.png** - Tuned feature importance 13. **13_model_comparison.png** - Baseline vs Tuned comparison --- ## 🚀 Usage Example ```python import joblib import pandas as pd import numpy as np # Load model and scaler model = joblib.load('tiktok_bot_detection_v2.pkl') scaler = joblib.load('tiktok_scaler_v2.pkl') # Prepare your data (example) data = { 'IsPrivate': 0.5, 'IsVerified': 0.5, 'HasProfilePic': 0.5, 'FollowingCount': 0.5, 'FollowerCount': 0.5, 'HasInstagram': 0.5, 'HasYoutube': 0.5, 'HasBio': 0.5, 'HasLinkInBio': 0.5, 'HasPosts': 0.5, 'PostsCount': 0.5, 'FollowToFollowerRatio': 0.5, } # Create DataFrame df = pd.DataFrame([data]) # Scale features df_scaled = scaler.transform(df) # Predict prediction = model.predict(df_scaled)[0] probability = model.predict_proba(df_scaled)[0] print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}") print(f"Bot Probability: {probability[1]:.4f}") print(f"Human Probability: {probability[0]:.4f}") ``` --- ## 📋 Confusion Matrix Breakdown ### Tuned Model (Test Set) ``` Predicted Human Bot Actual Human 220 24 Bot 18 334 ``` - **True Negatives (TN):** 220 (Correctly identified humans) - **False Positives (FP):** 24 (Humans incorrectly classified as bots) - **False Negatives (FN):** 18 (Bots incorrectly classified as humans) - **True Positives (TP):** 334 (Correctly identified bots) --- ## 🔍 Model Interpretation ### Strengths - High ROC-AUC score (0.9754) indicates excellent discrimination capability - Balanced precision and recall for both classes - Robust cross-validation performance ### Key Insights 1. Top features drive bot classification effectively 2. GridSearchCV improved performance over baseline by 0.25% 3. Model generalizes well on unseen test data --- ## 📝 Notes - **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range - **Missing Values:** Filled with 0 during preprocessing - **Class Balance:** Imbalanced dataset - **Model Type:** Ensemble method resistant to overfitting --- ## 🔄 Model Updates To retrain the model: 1. Place new training data in `../data/train_tiktok.csv` 2. Run the training notebook: `5_enhanced_training.ipynb` 3. Update this README with new metrics --- ## 📧 Contact & Support For questions or issues regarding this model, please refer to the main project documentation. --- **Generated:** 2025-11-27 11:38:35 **Notebook:** `5_enhanced_training.ipynb` **Platform:** Tiktok