Upload folder using huggingface_hub

7869c22 verified about 1 month ago

6.64 kB

	---
	language: "en"
	license: "apache-2.0"
	library_name: "scikit-learn"
	tags:
	- "bot-detection"
	- "tiktok"
	- "classification"
	- "scikit-learn"
	- "random-forest"
	---

	# TIKTOK Bot Detection Model

	## Overview

	This directory contains a trained Random Forest classifier for detecting bot accounts on Tiktok.

	Model Version: v2
	Training Date: 2025-11-27 11:38:35
	Framework: scikit-learn 1.5.2
	Algorithm: Random Forest Classifier with GridSearchCV Hyperparameter Tuning

	---

	## 📊 Model Performance

	### Final Metrics (Test Set)

	\| Metric \| Score \|
	\| --------------------- \| --------------- \|
	\| Accuracy \| 0.9295 (92.95%) \|
	\| Precision \| 0.9330 (93.30%) \|
	\| Recall \| 0.9489 (94.89%) \|
	\| F1-Score \| 0.9408 (94.08%) \|
	\| ROC-AUC \| 0.9754 (97.54%) \|
	\| Average Precision \| 0.9820 (98.20%) \|

	### Model Improvement

	- Baseline ROC-AUC: 0.9730
	- Tuned ROC-AUC: 0.9754
	- Improvement: 0.0024 (0.25%)

	---

	## 🗂️ Files

	\| File \| Description \|
	\| ----------------------------- \| -------------------------------------- \|
	\| `tiktok_bot_detection_v2.pkl` \| Trained Random Forest model \|
	\| `tiktok_scaler_v2.pkl` \| MinMaxScaler for feature normalization \|
	\| `tiktok_features_v2.json` \| List of features used by the model \|
	\| `tiktok_metrics_v2.txt` \| Detailed performance metrics report \|
	\| `images/` \| All visualization plots (13 images) \|
	\| `README.md` \| This file \|

	---

	## 🎯 Dataset Information

	### Training Configuration

	- Training Samples: 2,385
	- Test Samples: 596
	- Total Samples: 2,981
	- Number of Features: 12
	- Cross-Validation Folds: 5
	- Random State: 42

	### Class Distribution

	Training Set:

	- Human (0): 951 (39.87%)
	- Bot (1): 1,434 (60.13%)

	Test Set:

	- Human (0): 244 (40.94%)
	- Bot (1): 352 (59.06%)

	---

	## 🔧 Features (12)

	1. `IsPrivate`
	2. `IsVerified`
	3. `HasProfilePic`
	4. `FollowingCount`
	5. `FollowerCount`
	6. `HasInstagram`
	7. `HasYoutube`
	8. `HasBio`
	9. `HasLinkInBio`
	10. `HasPosts`
	11. `PostsCount`
	12. `FollowToFollowerRatio`

	---

	## 🏆 Top 5 Most Important Features

	12. FollowToFollowerRatio - 0.2693
	13. FollowerCount - 0.1753
	14. HasInstagram - 0.1499
	15. FollowingCount - 0.1236
	16. PostsCount - 0.1174

	---

	## ⚙️ Hyperparameters

	### Best Parameters (from GridSearchCV)

	- class_weight: None
	- max_depth: 13
	- max_features: sqrt
	- min_samples_leaf: 2
	- min_samples_split: 10
	- n_estimators: 100

	### Parameter Search Space

	- n_estimators: [100, 200, 300]
	- max_depth: [10, 15, 20, None]
	- min_samples_split: [2, 5, 10]
	- min_samples_leaf: [1, 2, 4]
	- max_features: ['sqrt', 'log2']
	- bootstrap: [True, False]

	Total combinations tested: 540

	---

	## 📈 Cross-Validation Results

	### Mean Scores (5-Fold Stratified CV)

	- Accuracy: 0.9191 (±0.0097)
	- Precision: 0.9326 (±0.0115)
	- Recall: 0.9331 (±0.0166)
	- F1-Score: 0.9327 (±0.0083)
	- ROC-AUC: 0.9744 (±0.0055)

	---

	## 🖼️ Visualizations

	All visualizations are saved in the `images/` directory:

	1. 01_class_distribution.png - Training/Test set class distribution
	2. 02_feature_correlation.png - Feature correlation with target variable
	3. 03_correlation_matrix.png - Feature correlation heatmap
	4. 04_baseline_confusion_matrix.png - Baseline model confusion matrix
	5. 05_baseline_roc_curve.png - Baseline ROC curve
	6. 06_baseline_precision_recall.png - Baseline Precision-Recall curve
	7. 07_baseline_feature_importance.png - Baseline feature importance
	8. 08_cross_validation.png - Cross-validation score distribution
	9. 09_tuned_confusion_matrix.png - Tuned model confusion matrix
	10. 10_tuned_roc_curve.png - Tuned ROC curve
	11. 11_tuned_precision_recall.png - Tuned Precision-Recall curve
	12. 12_tuned_feature_importance.png - Tuned feature importance
	13. 13_model_comparison.png - Baseline vs Tuned comparison

	---

	## 🚀 Usage Example

	```python
	import joblib
	import pandas as pd
	import numpy as np

	# Load model and scaler
	model = joblib.load('tiktok_bot_detection_v2.pkl')
	scaler = joblib.load('tiktok_scaler_v2.pkl')

	# Prepare your data (example)
	data = {
	'IsPrivate': 0.5,
	'IsVerified': 0.5,
	'HasProfilePic': 0.5,
	'FollowingCount': 0.5,
	'FollowerCount': 0.5,
	'HasInstagram': 0.5,
	'HasYoutube': 0.5,
	'HasBio': 0.5,
	'HasLinkInBio': 0.5,
	'HasPosts': 0.5,
	'PostsCount': 0.5,
	'FollowToFollowerRatio': 0.5,
	}

	# Create DataFrame
	df = pd.DataFrame([data])

	# Scale features
	df_scaled = scaler.transform(df)

	# Predict
	prediction = model.predict(df_scaled)[0]
	probability = model.predict_proba(df_scaled)[0]

	print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
	print(f"Bot Probability: {probability[1]:.4f}")
	print(f"Human Probability: {probability[0]:.4f}")
	```

	---

	## 📋 Confusion Matrix Breakdown

	### Tuned Model (Test Set)

	```
	Predicted
	Human Bot
	Actual Human 220 24
	Bot 18 334
	```

	- True Negatives (TN): 220 (Correctly identified humans)
	- False Positives (FP): 24 (Humans incorrectly classified as bots)
	- False Negatives (FN): 18 (Bots incorrectly classified as humans)
	- True Positives (TP): 334 (Correctly identified bots)

	---

	## 🔍 Model Interpretation

	### Strengths

	- High ROC-AUC score (0.9754) indicates excellent discrimination capability
	- Balanced precision and recall for both classes
	- Robust cross-validation performance

	### Key Insights

	1. Top features drive bot classification effectively
	2. GridSearchCV improved performance over baseline by 0.25%
	3. Model generalizes well on unseen test data

	---

	## 📝 Notes

	- Feature Scaling: All features are scaled using MinMaxScaler to [0, 1] range
	- Missing Values: Filled with 0 during preprocessing
	- Class Balance: Imbalanced dataset
	- Model Type: Ensemble method resistant to overfitting

	---

	## 🔄 Model Updates

	To retrain the model:

	1. Place new training data in `../data/train_tiktok.csv`
	2. Run the training notebook: `5_enhanced_training.ipynb`
	3. Update this README with new metrics

	---

	## 📧 Contact & Support

	For questions or issues regarding this model, please refer to the main project documentation.

	---

	Generated: 2025-11-27 11:38:35
	Notebook: `5_enhanced_training.ipynb`
	Platform: Tiktok