nahiar commited on Jan 7

Commit

6d16d09

verified ·

1 Parent(s): c03aefb

Initial upload (auto-create if missing)

Browse files

Files changed (21) hide show

README.md +275 -0
images/01_class_distribution.png +0 -0
images/02_future_correlation.png +0 -0
images/03_correlation_matrix.png +0 -0
images/04_baseline_confussion_matrix.png +0 -0
images/05_baseline_roc_curve.png +0 -0
images/06_baseline_precision_recall.png +0 -0
images/07_baseline_feture_important.png +0 -0
images/08_cross_validation.png +0 -0
images/09_tuned_confussion_matrix.png +0 -0
images/10_tuned_roc_curve.png +0 -0
images/11_tuned_precision_recall.png +0 -0
images/12_model_comparison.png +0 -0
inference_example.py +195 -0
requirements.txt +4 -0
tiktok_bot_detection.pkl +3 -0
tiktok_features.json +15 -0
tiktok_features.txt +13 -0
tiktok_metrics.txt +23 -0
tiktok_model_comparison.csv +7 -0
tiktok_scaler.pkl +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,275 @@

+---
+language: "en"
+license: "apache-2.0"
+library_name: "scikit-learn"
+tags:
+  - "bot-detection"
+  - "tiktok"
+  - "classification"
+  - "scikit-learn"
+  - "random-forest"
+---
+# TIKTOK Bot Detection Model
+## Overview
+This directory contains a trained Random Forest classifier for detecting bot accounts on Tiktok.
+**Model Version:** v2
+**Training Date:** 2025-12-30 11:38:35
+**Framework:** scikit-learn 1.5.2
+**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
+---
+## 📊 Model Performance
+### Final Metrics (Test Set)
+| Metric                | Score           |
+| --------------------- | --------------- |
+| **Accuracy**          | 0.9224 (92.24%) |
+| **Precision**         | 0.9596 (95.96%) |
+| **Recall**            | 0.9094 (90.94%) |
+| **F1-Score**          | 0.9338 (93.38%) |
+| **ROC-AUC**           | 0.9773 (97.73%) |
+| **Average Precision** | 0.9596 (95.96%) |
+### Model Improvement
+- **Baseline ROC-AUC:** 0.9759
+- **Tuned ROC-AUC:** 0.9773
+- **Improvement:** 0.0014 (0.14%)
+---
+## 🗂️ Files
+| File                          | Description                            |
+| ----------------------------- | -------------------------------------- |
+| `tiktok_bot_detection_v2.pkl` | Trained Random Forest model            |
+| `tiktok_scaler_v2.pkl`        | MinMaxScaler for feature normalization |
+| `tiktok_features_v2.json`     | List of features used by the model     |
+| `tiktok_metrics_v2.txt`       | Detailed performance metrics report    |
+| `images/`                     | All visualization plots (13 images)    |
+| `README.md`                   | This file                              |
+---
+## 🎯 Dataset Information
+### Training Configuration
+- **Training Samples:** 2,385
+- **Test Samples:** 596
+- **Total Samples:** 2,981
+- **Number of Features:** 12
+- **Cross-Validation Folds:** 5
+- **Random State:** 42
+### Class Distribution
+**Training Set:**
+- Human (0): 951 (39.87%)
+- Bot (1): 1,434 (60.13%)
+**Test Set:**
+- Human (0): 244 (40.94%)
+- Bot (1): 352 (59.06%)
+---
+## 🔧 Features (13)
+1. `IsPrivate`
+2. `IsVerified`
+3. `HasProfilePic`
+4. `FollowingCount`
+5. `FollowerCount`
+6. `LikesCount`
+7. `HasInstagram`
+8. `HasYoutube`
+9. `HasBio`
+10. `HasLinkInBio`
+11. `HasPosts`
+12. `PostsCount`
+13. `FollowToFollowerRatio`
+---
+## 🏆 Top 5 Most Important Features
+12. **FollowToFollowerRatio** - 0.2330
+13. **LikesCount** - 0.1771
+14. **HasInstagram** - 0.1395
+15. **FollowingCount** - 0.1349
+16. **FollowerCount** - 0.1055
+---
+## ⚙️ Hyperparameters
+### Best Parameters (from GridSearchCV)
+- **class_weight:** None
+- **max_depth:** 13
+- **max_features:** sqrt
+- **min_samples_leaf:** 2
+- **min_samples_split:** 10
+- **n_estimators:** 100
+### Parameter Search Space
+- **n_estimators:** [100, 200, 300]
+- **max_depth:** [10, 15, 20, None]
+- **min_samples_split:** [2, 5, 10]
+- **min_samples_leaf:** [1, 2, 4]
+- **max_features:** ['sqrt', 'log2']
+- **bootstrap:** [True, False]
+**Total combinations tested:** 540
+---
+## 📈 Cross-Validation Results
+### Mean Scores (5-Fold Stratified CV)
+- **Accuracy:** 0.9191 (±0.0097)
+- **Precision:** 0.9326 (±0.0115)
+- **Recall:** 0.9331 (±0.0166)
+- **F1-Score:** 0.9327 (±0.0083)
+- **ROC-AUC:** 0.9744 (±0.0055)
+---
+## 🖼️ Visualizations
+All visualizations are saved in the `images/` directory:
+1. **01_class_distribution.png** - Training/Test set class distribution
+2. **02_feature_correlation.png** - Feature correlation with target variable
+3. **03_correlation_matrix.png** - Feature correlation heatmap
+4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
+5. **05_baseline_roc_curve.png** - Baseline ROC curve
+6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
+7. **07_baseline_feature_importance.png** - Baseline feature importance
+8. **08_cross_validation.png** - Cross-validation score distribution
+9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
+10. **10_tuned_roc_curve.png** - Tuned ROC curve
+11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
+12. **12_tuned_feature_importance.png** - Tuned feature importance
+13. **13_model_comparison.png** - Baseline vs Tuned comparison
+---
+## 🚀 Usage Example
+```python
+import joblib
+import pandas as pd
+import numpy as np
+# Load model and scaler
+model = joblib.load('tiktok_bot_detection_v2.pkl')
+scaler = joblib.load('tiktok_scaler_v2.pkl')
+# Prepare your data (example)
+data = {
+    'IsPrivate': 0.5,
+    'IsVerified': 0.5,
+    'HasProfilePic': 0.5,
+    'FollowingCount': 0.5,
+    'FollowerCount': 0.5,
+    'LikesCount': 0.5,
+    'HasInstagram': 0.5,
+    'HasYoutube': 0.5,
+    'HasBio': 0.5,
+    'HasLinkInBio': 0.5,
+    'HasPosts': 0.5,
+    'PostsCount': 0.5,
+    'FollowToFollowerRatio': 0.5,
+}
+# Create DataFrame
+df = pd.DataFrame([data])
+# Scale features
+df_scaled = scaler.transform(df)
+# Predict
+prediction = model.predict(df_scaled)[0]
+probability = model.predict_proba(df_scaled)[0]
+print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
+print(f"Bot Probability: {probability[1]:.4f}")
+print(f"Human Probability: {probability[0]:.4f}")
+```
+---
+## 📋 Confusion Matrix Breakdown
+### Tuned Model (Test Set)
+```
+                Predicted
+              Human    Bot
+Actual Human      220      24
+       Bot         18     334
+```
+- **True Negatives (TN):** 220 (Correctly identified humans)
+- **False Positives (FP):** 24 (Humans incorrectly classified as bots)
+- **False Negatives (FN):** 18 (Bots incorrectly classified as humans)
+- **True Positives (TP):** 334 (Correctly identified bots)
+---
+## 🔍 Model Interpretation
+### Strengths
+- High ROC-AUC score (0.9754) indicates excellent discrimination capability
+- Balanced precision and recall for both classes
+- Robust cross-validation performance
+### Key Insights
+1. Top features drive bot classification effectively
+2. GridSearchCV improved performance over baseline by 0.25%
+3. Model generalizes well on unseen test data
+---
+## 📝 Notes
+- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
+- **Missing Values:** Filled with 0 during preprocessing
+- **Class Balance:** Imbalanced dataset
+- **Model Type:** Ensemble method resistant to overfitting
+---
+## 🔄 Model Updates
+To retrain the model:
+1. Place new training data in `../data/train_tiktok.csv`
+2. Run the training notebook: `5_enhanced_training.ipynb`
+3. Update this README with new metrics
+---
+## 📧 Contact & Support
+For questions or issues regarding this model, please refer to the main project documentation.
+---
+**Generated:** 2025-12-30 11:38:35
+**Notebook:** `5_enhanced_training.ipynb`
+**Platform:** Tiktok

images/01_class_distribution.png ADDED Viewed

images/02_future_correlation.png ADDED Viewed

images/03_correlation_matrix.png ADDED Viewed

images/04_baseline_confussion_matrix.png ADDED Viewed

images/05_baseline_roc_curve.png ADDED Viewed

images/06_baseline_precision_recall.png ADDED Viewed

images/07_baseline_feture_important.png ADDED Viewed

images/08_cross_validation.png ADDED Viewed

images/09_tuned_confussion_matrix.png ADDED Viewed

images/10_tuned_roc_curve.png ADDED Viewed

images/11_tuned_precision_recall.png ADDED Viewed

images/12_model_comparison.png ADDED Viewed

inference_example.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""
+Example inference script for TikTok Bot Detection Model
+"""
+import joblib
+import pandas as pd
+from sklearn.preprocessing import MinMaxScaler
+def load_model(model_path="TIKTOK_BOT_Detection_Model_v1.pkl"):
+    """Load the trained bot detection model"""
+    return joblib.load(model_path)
+def prepare_features(account_data):
+    """
+    Prepare account features for prediction
+    Args:
+        account_data (dict): Dictionary containing account features
+    Returns:
+        numpy.ndarray: Scaled features ready for prediction
+    """
+    features = [
+        "IsPrivate",
+        "IsVerified",
+        "HasProfilePic",
+        "FollowingCount",
+        "FollowerCount",
+        "HasInstagram",
+        "HasYoutube",
+        "HasBio",
+        "HasLinkInBio",
+        "HasPosts",
+        "PostsCount",
+        "FollowToFollowerRatio",
+    ]
+    df = pd.DataFrame([account_data])
+    # Scale features
+    scaler = MinMaxScaler()
+    df_scaled = scaler.fit_transform(df[features])
+    return df_scaled
+def predict_single_account(model, account_data):
+    """
+    Predict if a single account is a bot
+    Args:
+        model: Trained sklearn model
+        account_data (dict): Account features
+    Returns:
+        dict: Prediction results with probabilities
+    """
+    features_scaled = prepare_features(account_data)
+    prediction = model.predict(features_scaled)[0]
+    probability = model.predict_proba(features_scaled)[0]
+    return {
+        "is_bot": bool(prediction),
+        "bot_probability": float(probability[1]),
+        "human_probability": float(probability[0]),
+        "confidence": float(max(probability)),
+    }
+def predict_batch(model, accounts_df):
+    """
+    Predict for multiple accounts at once
+    Args:
+        model: Trained sklearn model
+        accounts_df (pd.DataFrame): DataFrame with account features
+    Returns:
+        pd.DataFrame: Original data with predictions added
+    """
+    features = [
+        "IsPrivate",
+        "IsVerified",
+        "HasProfilePic",
+        "FollowingCount",
+        "FollowerCount",
+        "HasInstagram",
+        "HasYoutube",
+        "HasBio",
+        "HasLinkInBio",
+        "HasPosts",
+        "PostsCount",
+        "FollowToFollowerRatio",
+    ]
+    scaler = MinMaxScaler()
+    features_scaled = scaler.fit_transform(accounts_df[features])
+    predictions = model.predict(features_scaled)
+    probabilities = model.predict_proba(features_scaled)
+    accounts_df["is_bot"] = predictions
+    accounts_df["bot_probability"] = probabilities[:, 1]
+    accounts_df["human_probability"] = probabilities[:, 0]
+    return accounts_df
+# Example usage
+if __name__ == "__main__":
+    # Load model
+    print("Loading TikTok bot detection model...")
+    model = load_model()
+    print("✓ Model loaded successfully!\n")
+    # Example 1: Single account prediction
+    print("=" * 60)
+    print("Example 1: Single Account Prediction")
+    print("=" * 60)
+    suspicious_account = {
+        "IsPrivate": 0,
+        "IsVerified": 0,
+        "HasProfilePic": 1,
+        "FollowingCount": 5000,
+        "FollowerCount": 100,
+        "HasInstagram": 0,
+        "HasYoutube": 0,
+        "HasBio": 0,
+        "HasLinkInBio": 1,
+        "HasPosts": 1,
+        "PostsCount": 50,
+        "FollowToFollowerRatio": 50.0,
+    }
+    result = predict_single_account(model, suspicious_account)
+    print(f"Account Analysis:")
+    print(f"  Following: {suspicious_account['FollowingCount']}")
+    print(f"  Followers: {suspicious_account['FollowerCount']}")
+    print(f"  Posts: {suspicious_account['PostsCount']}")
+    print(f"\nPrediction:")
+    print(f"  Is Bot: {result['is_bot']}")
+    print(f"  Bot Probability: {result['bot_probability']:.2%}")
+    print(f"  Confidence: {result['confidence']:.2%}")
+    # Example 2: Batch prediction
+    print(f"\n{'='*60}")
+    print("Example 2: Batch Prediction")
+    print("=" * 60)
+    accounts = pd.DataFrame(
+        [
+            {
+                "IsPrivate": 0,
+                "IsVerified": 1,
+                "HasProfilePic": 1,
+                "FollowingCount": 500,
+                "FollowerCount": 10000,
+                "HasInstagram": 1,
+                "HasYoutube": 1,
+                "HasBio": 1,
+                "HasLinkInBio": 1,
+                "HasPosts": 1,
+                "PostsCount": 200,
+                "FollowToFollowerRatio": 0.05,
+            },
+            {
+                "IsPrivate": 0,
+                "IsVerified": 0,
+                "HasProfilePic": 0,
+                "FollowingCount": 8000,
+                "FollowerCount": 50,
+                "HasInstagram": 0,
+                "HasYoutube": 0,
+                "HasBio": 0,
+                "HasLinkInBio": 1,
+                "HasPosts": 1,
+                "PostsCount": 10,
+                "FollowToFollowerRatio": 160.0,
+            },
+        ]
+    )
+    results = predict_batch(model, accounts.copy())
+    print("\nResults:")
+    for idx, row in results.iterrows():
+        print(f"\nAccount {idx + 1}:")
+        print(f"  Followers: {row['FollowerCount']}")
+        print(f"  Is Bot: {bool(row['is_bot'])}")
+        print(f"  Bot Probability: {row['bot_probability']:.2%}")

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+scikit-learn>=1.7.2
+pandas>=2.0.0
+numpy>=1.24.0
+joblib>=1.3.0

tiktok_bot_detection.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b8b70c4f9f2da43c1a55cf5beaa7f412133c53a994f0f5e2ae555fec43657b5b
+size 5917753

tiktok_features.json ADDED Viewed

	@@ -0,0 +1,15 @@

+[
+  "IsPrivate",
+  "IsVerified",
+  "HasProfilePic",
+  "FollowingCount",
+  "FollowerCount",
+  "LikesCount",
+  "HasInstagram",
+  "HasYoutube",
+  "HasBio",
+  "HasLinkInBio",
+  "HasPosts",
+  "PostsCount",
+  "FollowToFollowerRatio"
+]

tiktok_features.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+IsPrivate
+IsVerified
+HasProfilePic
+FollowingCount
+FollowerCount
+LikesCount
+HasInstagram
+HasYoutube
+HasBio
+HasLinkInBio
+HasPosts
+PostsCount
+FollowToFollowerRatio

tiktok_metrics.txt ADDED Viewed

	@@ -0,0 +1,23 @@

+Model: Tiktok Bot Detection
+Date: 2026-01-07 14:47:41.866486
+============================================================
+Performance Metrics
+============================================================
+Accuracy: 0.9224
+Precision: 0.9596
+Recall: 0.9094
+F1: 0.9338
+Roc_auc: 0.9773
+Avg_precision: 0.9844
+Best Parameters:
+  class_weight: balanced
+  max_depth: 30
+  max_features: sqrt
+  min_samples_leaf: 1
+  min_samples_split: 5
+  n_estimators: 300
+Cross-Validation ROC-AUC: 0.9751 (+/- 0.0205)

tiktok_model_comparison.csv ADDED Viewed

	@@ -0,0 +1,7 @@

+Metric,Baseline,Tuned,Improvement,Improvement %
+Accuracy,0.924496644295302,0.9295302013422819,0.005033557046979942,0.5444646098003711
+Precision,0.9299719887955182,0.9329608938547486,0.002988905059230329,0.32139732112808056
+Recall,0.9431818181818182,0.9488636363636364,0.005681818181818121,0.6024096385542104
+F1-Score,0.9365303244005642,0.9408450704225352,0.0043147460219710165,0.46071610385202577
+ROC-AUC,0.9729531482861401,0.9753807283904621,0.0024275801043219802,0.24950637228505504
+Avg Precision,0.9811620379557393,0.9820393653516898,0.0008773273959504779,0.08941717698112313

tiktok_scaler.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e7768bfc30d959fda3f4cea858c95e8896d035dd77847ee96dc1e582a36d5a4e
+size 1631