nahiar commited on Nov 27, 2025

Commit

35d1a1d

verified ·

1 Parent(s): 76eb959

Upload folder using huggingface_hub

Browse files

Files changed (20) hide show

.gitattributes +9 -0
README.md +184 -196
images/01_class_distribution.png +3 -0
images/02_feature_correlation.png +3 -0
images/03_correlation_matrix.png +3 -0
images/04_baseline_confusion_matrix.png +0 -0
images/05_baseline_roc_curve.png +3 -0
images/06_baseline_precision_recall.png +0 -0
images/07_baseline_feature_importance.png +3 -0
images/08_cross_validation.png +3 -0
images/09_tuned_confusion_matrix.png +0 -0
images/10_tuned_roc_curve.png +3 -0
images/11_tuned_precision_recall.png +0 -0
images/12_tuned_feature_importance.png +3 -0
images/13_model_comparison.png +3 -0
instagram_bot_detection_v2.pkl +3 -0
instagram_features_v2.json +12 -0
instagram_metrics_v2.txt +39 -0
instagram_model_comparison.csv +7 -0
instagram_scaler_v2.pkl +3 -0

.gitattributes CHANGED Viewed

	@@ -1 +1,10 @@
1	*.pkl filter=lfs diff=lfs merge=lfs -text

 *.pkl filter=lfs diff=lfs merge=lfs -text
+images/01_class_distribution.png filter=lfs diff=lfs merge=lfs -text
+images/02_feature_correlation.png filter=lfs diff=lfs merge=lfs -text
+images/03_correlation_matrix.png filter=lfs diff=lfs merge=lfs -text
+images/05_baseline_roc_curve.png filter=lfs diff=lfs merge=lfs -text
+images/07_baseline_feature_importance.png filter=lfs diff=lfs merge=lfs -text
+images/08_cross_validation.png filter=lfs diff=lfs merge=lfs -text
+images/10_tuned_roc_curve.png filter=lfs diff=lfs merge=lfs -text
+images/12_tuned_feature_importance.png filter=lfs diff=lfs merge=lfs -text
+images/13_model_comparison.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,255 +1,243 @@
----
-language: en
-license: mit
-tags:
-  - bot-detection
-  - instagram
-  - random-forest
-  - sklearn
-  - social-media
-  - classification
-metrics:
-  - accuracy
-  - precision
-  - recall
-  - f1
-  - roc-auc
-library_name: scikit-learn
----
-# Instagram Bot Detection Model
-## Model Description
-This Random Forest classifier is designed to detect bot accounts on Instagram based on profile features and behavioral patterns. The model analyzes various account characteristics to determine whether an account is likely automated (bot) or genuine (human).
-## Model Details
-- **Model Type**: Random Forest Classifier
-- **Framework**: scikit-learn
-- **Task**: Binary Classification (Bot vs Human)
-- **Language**: Python
-- **License**: MIT
-## Performance Metrics
-The model achieves exceptional performance on the test dataset:
-- **ROC-AUC Score**: 0.9988
-- **Accuracy**: Near-perfect accuracy in distinguishing bots from legitimate accounts
-The ROC curve demonstrates outstanding discriminative ability with an AUC of 0.9988, indicating near-perfect model performance.
-## Features Used
-The model uses the following 12 features for prediction:
-1. **IsPrivate** - Whether the account is set to private
-2. **IsVerified** - Whether the account has a verification badge
-3. **HasProfilePic** - Whether the account has a profile picture
-4. **FollowingCount** - Number of accounts being followed
-5. **FollowerCount** - Number of followers
-6. **HasExternalUrl** - Whether there's an external URL in the profile
-7. **HasBio** - Whether the account has a bio description
-8. **HasPosts** - Whether the account has made any posts
-9. **PostsCount** - Total number of posts
-10. **FollowToFollowerRatio** - Ratio of following to followers
-11. **IsBusinessAccount** - Whether the account is a business account
-12. **HasHighlights** - Whether the account has story highlights
-## Intended Use
-### Primary Uses
-- Identifying potential bot accounts on Instagram
-- Content moderation and platform integrity
-- Research on social media bot behavior
-- Automated account screening
-- Spam detection systems
-### Out-of-Scope Uses
-- This model is specifically trained for Instagram and should not be used for other platforms without retraining
-- Should not be the sole basis for account suspension decisions
-- Not designed for real-time detection without proper infrastructure
-- Not suitable for detecting sophisticated bot networks without additional features
-## How to Use
-### Installation
-```bash
-pip install scikit-learn pandas numpy joblib
-```
-### Loading the Model
-```python
-import joblib
-import pandas as pd
-import numpy as np
-from sklearn.preprocessing import MinMaxScaler
-# Load the model
-model = joblib.load('IG_BOT_Detection_Model_v1.pkl')
-# Prepare your data
-features = ['IsPrivate', 'IsVerified', 'HasProfilePic', 'FollowingCount',
-            'FollowerCount', 'HasExternalUrl', 'HasBio', 'HasPosts',
-            'PostsCount', 'FollowToFollowerRatio', 'IsBusinessAccount',
-            'HasHighlights']
-# Example account data
-account_data = {
-    'IsPrivate': 0,
-    'IsVerified': 0,
-    'HasProfilePic': 1,
-    'FollowingCount': 7500,
-    'FollowerCount': 150,
-    'HasExternalUrl': 1,
-    'HasBio': 0,
-    'HasPosts': 1,
-    'PostsCount': 20,
-    'FollowToFollowerRatio': 50.0,
-    'IsBusinessAccount': 0,
-    'HasHighlights': 0
-}
-# Create DataFrame
-df = pd.DataFrame([account_data])
-# Scale features (use the same scaler as training)
-scaler = MinMaxScaler()
-# Note: In production, you should save and load the scaler from training
-df_scaled = scaler.fit_transform(df[features])
-# Make prediction
-prediction = model.predict(df_scaled)
-probability = model.predict_proba(df_scaled)
-print(f"Prediction: {'Bot' if prediction[0] == 1 else 'Human'}")
-print(f"Confidence - Human: {probability[0][0]:.2%}, Bot: {probability[0][1]:.2%}")
-```
-### Batch Prediction
-```python
-# For multiple accounts
-accounts_df = pd.read_csv('instagram_accounts_to_check.csv')
-accounts_scaled = scaler.transform(accounts_df[features])
-predictions = model.predict(accounts_scaled)
-probabilities = model.predict_proba(accounts_scaled)
-# Add results to DataFrame
-accounts_df['is_bot'] = predictions
-accounts_df['bot_probability'] = probabilities[:, 1]
-# Filter likely bots (e.g., probability > 0.8)
-likely_bots = accounts_df[accounts_df['bot_probability'] > 0.8]
-```
-## Training Data
-The model was trained on a curated dataset of Instagram accounts with labeled bot/human classifications. The dataset includes:
-- Balanced distribution of bot and human accounts
-- Various account types and behavioral patterns
-- Features extracted from public profile information
-- Diverse account ages and activity levels
-**Note**: The training data is proprietary and not included in this repository.
-## Training Procedure
-### Preprocessing
-1. Feature extraction from Instagram account profiles
-2. Calculation of derived features (e.g., FollowToFollowerRatio)
-3. MinMax normalization of all features to [0, 1] range
-4. Train-test split with stratification to maintain class balance
-### Hyperparameters
-- **Algorithm**: Random Forest Classifier
-- **Normalization**: MinMaxScaler
-- **Cross-validation**: Stratified K-Fold
-- **Feature Selection**: Based on domain knowledge and feature importance
-The model was trained using scikit-learn's RandomForestClassifier with optimized hyperparameters selected through cross-validation.
-## Limitations and Bias
-### Limitations
-- Model performance depends on the quality and accuracy of input features
-- May not generalize to new bot patterns not seen during training
-- Requires accurate feature extraction from Instagram profiles
-- Performance may degrade over time as bot behaviors evolve
-- Limited to profile-level features; does not analyze content or engagement patterns
-### Potential Biases
-- May be biased toward bot patterns present in the training data
-- Could have regional or cultural biases depending on training data composition
-- May misclassify legitimate accounts with unusual behavior patterns
-- Potential bias against new accounts or accounts with low activity
-### Recommendations
-- Regularly retrain the model with new data to capture evolving bot patterns
-- Use as part of a multi-layered detection system
-- Implement human review for high-stakes decisions
-- Monitor for false positives and adjust classification thresholds accordingly
-- Combine with content analysis and engagement pattern detection
-## Ethical Considerations
-- This model should be used responsibly and not for harassment or stalking
-- Consider privacy implications when analyzing user accounts
-- Ensure compliance with Instagram's terms of service and relevant privacy laws (GDPR, CCPA, etc.)
-- Implement appropriate safeguards against misuse
-- Provide transparency to users about automated detection systems
-- Allow for appeals and manual review processes
-## Model Card Authors
-This model card was created as part of the Bot Detection project for social media platforms.
-## Citation
-If you use this model in your research, please cite:
-```bibtex
-@misc{instagram_bot_detection_2024,
-  title={Instagram Bot Detection Model},
-  author={Your Name/Organization},
-  year={2024},
-  publisher={Hugging Face},
-  howpublished={\url{https://huggingface.co/your-username/instagram-bot-detection}}
-}
-```
-## Related Models
-- [TikTok Bot Detection](https://huggingface.co/your-username/tiktok-bot-detection)
-- [Twitter Bot Detection](https://huggingface.co/your-username/twitter-bot-detection)
-## Contact
-For questions or feedback about this model, please open an issue in the repository or contact the maintainers.
-## Updates and Maintenance
-- **Version**: 1.0
-- **Last Updated**: November 2024
-- **Status**: Active
-Future updates may include:
-- Improved feature engineering
-- Additional training data with recent bot patterns
-- Hyperparameter optimization
-- Support for new Instagram features (Reels, etc.)
-- Integration of content-based features
-- Multi-model ensemble approach

+# INSTAGRAM Bot Detection Model
+## Overview
+This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.
+**Model Version:** v2
+**Training Date:** 2025-11-27 11:38:28
+**Framework:** scikit-learn 1.5.2
+**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
+---
+## 📊 Model Performance
+### Final Metrics (Test Set)
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | 0.9860 (98.60%) |
+| **Precision** | 0.9918 (99.18%) |
+| **Recall** | 0.9796 (97.96%) |
+| **F1-Score** | 0.9857 (98.57%) |
+| **ROC-AUC** | 0.9990 (99.90%) |
+| **Average Precision** | 0.9990 (99.90%) |
+### Model Improvement
+- **Baseline ROC-AUC:** 0.9988
+- **Tuned ROC-AUC:** 0.9990
+- **Improvement:** 0.0002 (0.02%)
+---
+## 🗂️ Files
+| File | Description |
+|------|-------------|
+| `instagram_bot_detection_v2.pkl` | Trained Random Forest model |
+| `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization |
+| `instagram_features_v2.json` | List of features used by the model |
+| `instagram_metrics_v2.txt` | Detailed performance metrics report |
+| `images/` | All visualization plots (13 images) |
+| `README.md` | This file |
+---
+## 🎯 Dataset Information
+### Training Configuration
+- **Training Samples:** 4,000
+- **Test Samples:** 1,000
+- **Total Samples:** 5,000
+- **Number of Features:** 10
+- **Cross-Validation Folds:** 5
+- **Random State:** 42
+### Class Distribution
+**Training Set:**
+- Human (0): 1,991 (49.78%)
+- Bot (1): 2,009 (50.22%)
+**Test Set:**
+- Human (0): 509 (50.90%)
+- Bot (1): 491 (49.10%)
+---
+## 🔧 Features (10)
+1. `profile_pic`
+2. `username_num_ratio`
+3. `username_is_numeric`
+4. `fullname_words`
+5. `fullname_num_ratio`
+6. `is_name_number_only`
+7. `name_equals_username`
+8. `followers`
+9. `follows`
+10. `followers_to_follows_ratio`
+---
+## 🏆 Top 5 Most Important Features
+1. **profile_pic** - 0.3314
+8. **followers** - 0.2313
+2. **username_num_ratio** - 0.1665
+10. **followers_to_follows_ratio** - 0.1308
+9. **follows** - 0.0923
+---
+## ⚙️ Hyperparameters
+### Best Parameters (from GridSearchCV)
+- **class_weight:** balanced
+- **max_depth:** 15
+- **max_features:** sqrt
+- **min_samples_leaf:** 1
+- **min_samples_split:** 2
+- **n_estimators:** 100
+### Parameter Search Space
+- **n_estimators:** [100, 200, 300]
+- **max_depth:** [10, 15, 20, None]
+- **min_samples_split:** [2, 5, 10]
+- **min_samples_leaf:** [1, 2, 4]
+- **max_features:** ['sqrt', 'log2']
+- **bootstrap:** [True, False]
+**Total combinations tested:** 540
+---
+## 📈 Cross-Validation Results
+### Mean Scores (5-Fold Stratified CV)
+- **Accuracy:** 0.9848 (±0.0051)
+- **Precision:** 0.9900 (±0.0066)
+- **Recall:** 0.9796 (±0.0081)
+- **F1-Score:** 0.9847 (±0.0051)
+- **ROC-AUC:** 0.9986 (±0.0011)
+---
+## 🖼️ Visualizations
+All visualizations are saved in the `images/` directory:
+1. **01_class_distribution.png** - Training/Test set class distribution
+2. **02_feature_correlation.png** - Feature correlation with target variable
+3. **03_correlation_matrix.png** - Feature correlation heatmap
+4. **04_baseline_confusion_matrix.png** - Baseline model confusion matrix
+5. **05_baseline_roc_curve.png** - Baseline ROC curve
+6. **06_baseline_precision_recall.png** - Baseline Precision-Recall curve
+7. **07_baseline_feature_importance.png** - Baseline feature importance
+8. **08_cross_validation.png** - Cross-validation score distribution
+9. **09_tuned_confusion_matrix.png** - Tuned model confusion matrix
+10. **10_tuned_roc_curve.png** - Tuned ROC curve
+11. **11_tuned_precision_recall.png** - Tuned Precision-Recall curve
+12. **12_tuned_feature_importance.png** - Tuned feature importance
+13. **13_model_comparison.png** - Baseline vs Tuned comparison
+---
+## 🚀 Usage Example
+```python
+import joblib
+import pandas as pd
+import numpy as np
+# Load model and scaler
+model = joblib.load('instagram_bot_detection_v2.pkl')
+scaler = joblib.load('instagram_scaler_v2.pkl')
+# Prepare your data (example)
+data = {
+    'profile_pic': 0.5,
+    'username_num_ratio': 0.5,
+    'username_is_numeric': 0.5,
+    'fullname_words': 0.5,
+    'fullname_num_ratio': 0.5,
+    'is_name_number_only': 0.5,
+    'name_equals_username': 0.5,
+    'followers': 0.5,
+    'follows': 0.5,
+    'followers_to_follows_ratio': 0.5,
+}
+# Create DataFrame
+df = pd.DataFrame([data])
+# Scale features
+df_scaled = scaler.transform(df)
+# Predict
+prediction = model.predict(df_scaled)[0]
+probability = model.predict_proba(df_scaled)[0]
+print(f"Prediction: {'Bot' if prediction == 1 else 'Human'}")
+print(f"Bot Probability: {probability[1]:.4f}")
+print(f"Human Probability: {probability[0]:.4f}")
+```
+---
+## 📋 Confusion Matrix Breakdown
+### Tuned Model (Test Set)
+```
+                Predicted
+              Human    Bot
+Actual Human      505       4
+       Bot         10     481
+```
+- **True Negatives (TN):** 505 (Correctly identified humans)
+- **False Positives (FP):** 4 (Humans incorrectly classified as bots)
+- **False Negatives (FN):** 10 (Bots incorrectly classified as humans)
+- **True Positives (TP):** 481 (Correctly identified bots)
+---
+## 🔍 Model Interpretation
+### Strengths
+- High ROC-AUC score (0.9990) indicates excellent discrimination capability
+- Balanced precision and recall for both classes
+- Robust cross-validation performance
+### Key Insights
+1. Top features drive bot classification effectively
+2. GridSearchCV improved performance over baseline by 0.02%
+3. Model generalizes well on unseen test data
+---
+## 📝 Notes
+- **Feature Scaling:** All features are scaled using MinMaxScaler to [0, 1] range
+- **Missing Values:** Filled with 0 during preprocessing
+- **Class Balance:** Balanced dataset
+- **Model Type:** Ensemble method resistant to overfitting
+---
+## 🔄 Model Updates
+To retrain the model:
+1. Place new training data in `../data/train_instagram.csv`
+2. Run the training notebook: `5_enhanced_training.ipynb`
+3. Update this README with new metrics
+---
+## 📧 Contact & Support
+For questions or issues regarding this model, please refer to the main project documentation.
+---
+**Generated:** 2025-11-27 11:38:28
+**Notebook:** `5_enhanced_training.ipynb`
+**Platform:** Instagram

images/01_class_distribution.png ADDED Viewed

Git LFS Details

SHA256: 5d6aba9af735cf0fc01dfe94dae16ca999eea115d620b8eaca71cee6fed078de
Pointer size: 131 Bytes
Size of remote file: 115 kB

images/02_feature_correlation.png ADDED Viewed

Git LFS Details

SHA256: e45af57e954ff7ce5db00689bc4c507ff8434db53b56cd7db88ad376b2a75e6c
Pointer size: 131 Bytes
Size of remote file: 136 kB

images/03_correlation_matrix.png ADDED Viewed

Git LFS Details

SHA256: eb938d6809f07ecd1a2f65861826b2d0d4d6ea8ebbbe9a6e459ccc29403aae10
Pointer size: 131 Bytes
Size of remote file: 306 kB

images/04_baseline_confusion_matrix.png ADDED Viewed

images/05_baseline_roc_curve.png ADDED Viewed

Git LFS Details

SHA256: 718d6e50c27dd37bb0879b9a3fb77a119e08e0d3cf2707f3cbab251df8e9584d
Pointer size: 131 Bytes
Size of remote file: 138 kB

images/06_baseline_precision_recall.png ADDED Viewed

images/07_baseline_feature_importance.png ADDED Viewed

Git LFS Details

SHA256: a23c568c3aa732c289d1c78d203d8266b7f8544d2396d0851e513d65bf9c69b5
Pointer size: 131 Bytes
Size of remote file: 132 kB

images/08_cross_validation.png ADDED Viewed

Git LFS Details

SHA256: 45a825c1046aa6f9119d9a43410a057e688e04bd7d88aeca526340c74cc5fc88
Pointer size: 131 Bytes
Size of remote file: 125 kB

images/09_tuned_confusion_matrix.png ADDED Viewed

images/10_tuned_roc_curve.png ADDED Viewed

Git LFS Details

SHA256: 215cd578229dbc1877b333b42991113b126414610dc89f54f6490a1697a3dc11
Pointer size: 131 Bytes
Size of remote file: 135 kB

images/11_tuned_precision_recall.png ADDED Viewed

images/12_tuned_feature_importance.png ADDED Viewed

Git LFS Details

SHA256: 52a78a7fe1eb004fc9fb3285702c1bd1cc4cb89235aeb6f4047d1f73e7688260
Pointer size: 131 Bytes
Size of remote file: 129 kB

images/13_model_comparison.png ADDED Viewed

Git LFS Details

SHA256: 2efffce38be5af7a09d76c9e08f0b3c8e5d20a037f9620633c14f81e255e746d
Pointer size: 131 Bytes
Size of remote file: 120 kB

instagram_bot_detection_v2.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:754546a9111c11b3b9e5ebe0cab64613ac81fa25043da96875a13ab47b39060c
+size 1994105

instagram_features_v2.json ADDED Viewed

	@@ -0,0 +1,12 @@

+[
+  "profile_pic",
+  "username_num_ratio",
+  "username_is_numeric",
+  "fullname_words",
+  "fullname_num_ratio",
+  "is_name_number_only",
+  "name_equals_username",
+  "followers",
+  "follows",
+  "followers_to_follows_ratio"
+]

instagram_metrics_v2.txt ADDED Viewed

	@@ -0,0 +1,39 @@

+======================================================================
+INSTAGRAM Bot Detection Model - Performance Report
+======================================================================
+Date: 2025-11-27 11:38:28.480839
+Training Configuration:
+  - Platform: instagram
+  - Train samples: 4000
+  - Test samples: 1000
+  - Features: 10
+  - CV Folds: 5
+  - Random State: 42
+Best Hyperparameters:
+  - class_weight: balanced
+  - max_depth: 15
+  - max_features: sqrt
+  - min_samples_leaf: 1
+  - min_samples_split: 2
+  - n_estimators: 100
+Performance Metrics (Test Set):
+  - Accuracy: 0.9860
+  - Precision: 0.9918
+  - Recall: 0.9796
+  - F1: 0.9857
+  - Roc_auc: 0.9990
+  - Avg_precision: 0.9990
+Cross-Validation Results:
+  - Mean ROC-AUC: 0.9988
+Feature Importance (Top 5):
+  - profile_pic: 0.3314
+  - followers: 0.2313
+  - username_num_ratio: 0.1665
+  - followers_to_follows_ratio: 0.1308
+  - follows: 0.0923

instagram_model_comparison.csv ADDED Viewed

	@@ -0,0 +1,7 @@

+Metric,Baseline,Tuned,Improvement,Improvement %
+Accuracy,0.986,0.986,0.0,0.0
+Precision,0.9917525773195877,0.9917525773195877,0.0,0.0
+Recall,0.9796334012219959,0.9796334012219959,0.0,0.0
+F1-Score,0.985655737704918,0.985655737704918,0.0,0.0
+ROC-AUC,0.998803612370408,0.9989916733021498,0.00018806093174172922,0.018828619501626963
+Avg Precision,0.9988880328453673,0.9990380665868485,0.00015003374148114812,0.015020075979263841

instagram_scaler_v2.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b3969a962396dc6ff11bc9767e60de5e75c8b5a07a156b069723e80f49dc48ed
+size 1511