--- language: "en" license: "apache-2.0" created: "2025-11-27T05:32:51.193018Z" tags: - "bot-detection" - "instagram" - "classification" --- # INSTAGRAM Bot Detection Model ## Overview This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram. **Model Version:** v2 **Training Date:** 2025-11-27 11:38:28 **Framework:** scikit-learn 1.5.2 **Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning --- ## 📊 Model Performance ### Final Metrics (Test Set) | Metric | Score | | --------------------- | --------------- | | **Accuracy** | 0.9860 (98.60%) | | **Precision** | 0.9918 (99.18%) | | **Recall** | 0.9796 (97.96%) | | **F1-Score** | 0.9857 (98.57%) | | **ROC-AUC** | 0.9990 (99.90%) | | **Average Precision** | 0.9990 (99.90%) | ### Model Improvement - **Baseline ROC-AUC:** 0.9988 - **Tuned ROC-AUC:** 0.9990 - **Improvement:** 0.0002 (0.02%) --- ## 🗂️ Files | File | Description | | -------------------------------- | -------------------------------------- | | `instagram_bot_detection_v2.pkl` | Trained Random Forest model | | `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization | | `instagram_features_v2.json` | List of features used by the model | | `instagram_metrics_v2.txt` | Detailed performance metrics report | | `images/` | All visualization plots (13 images) | | `README.md` | This file | --- ## 🎯 Dataset Information ### Training Configuration - **Training Samples:** 4,000 - **Test Samples:** 1,000 - **Total Samples:** 5,000 - **Features:** 10 - **Cross-Validation Folds:** 5 - **Random State:** 42 ### Class Distribution The model was trained with balanced class weights to handle any class imbalance in the dataset. --- ## 🔧 Model Architecture ### Algorithm **Random Forest Classifier** - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees. ### Hyperparameters (Tuned via GridSearchCV) | Parameter | Value | | ------------------- | -------- | | `n_estimators` | 100 | | `max_depth` | 15 | | `min_samples_split` | 2 | | `min_samples_leaf` | 1 | | `max_features` | sqrt | | `class_weight` | balanced | ### Feature Preprocessing - **Scaler:** MinMaxScaler (normalizes features to [0, 1] range) - **Missing Values:** Handled during data preprocessing - **Feature Engineering:** Custom features derived from account metadata --- ## 📈 Feature Importance The model uses 10 features to detect bot accounts. Top 5 most important features: | Rank | Feature | Importance | Description | | ---- | ---------------------------- | ---------- | ------------------------------------------ | | 1 | `profile_pic` | 0.3314 | Indicates if account has a profile picture | | 2 | `followers` | 0.2313 | Number of followers | | 3 | `username_num_ratio` | 0.1665 | Ratio of numbers in username | | 4 | `followers_to_follows_ratio` | 0.1308 | Ratio of followers to following count | | 5 | `follows` | 0.0923 | Number of accounts followed | ### All Features 1. `profile_pic` - Profile picture presence 2. `username_num_ratio` - Numeric character ratio in username 3. `username_is_numeric` - Username is entirely numeric 4. `fullname_words` - Number of words in full name 5. `fullname_num_ratio` - Numeric character ratio in full name 6. `is_name_number_only` - Full name contains only numbers 7. `name_equals_username` - Full name matches username 8. `followers` - Follower count 9. `follows` - Following count 10. `followers_to_follows_ratio` - Follower/following ratio --- ## 🚀 Usage ### Prerequisites ```bash pip install scikit-learn joblib numpy ``` ### Loading the Model ```python import joblib import numpy as np # Load model and scaler model = joblib.load('instagram_bot_detection_v2.pkl') scaler = joblib.load('instagram_scaler_v2.pkl') # Example prediction features = np.array([[ 1, # profile_pic 0.15, # username_num_ratio 0, # username_is_numeric 2, # fullname_words 0.0, # fullname_num_ratio 0, # is_name_number_only 0, # name_equals_username 1200, # followers 300, # follows 4.0 # followers_to_follows_ratio ]]) # Scale features features_scaled = scaler.transform(features) # Make prediction prediction = model.predict(features_scaled) probability = model.predict_proba(features_scaled) print(f"Bot: {prediction[0] == 1}") print(f"Probability: {probability[0][1]:.4f}") ``` ### API Integration ```python def predict_instagram_bot(account_data: dict) -> dict: """ Predict if an Instagram account is a bot. Args: account_data: Dictionary with account features Returns: Dictionary with prediction and probability """ features = np.array([[ account_data['profile_pic'], account_data['username_num_ratio'], account_data['username_is_numeric'], account_data['fullname_words'], account_data['fullname_num_ratio'], account_data['is_name_number_only'], account_data['name_equals_username'], account_data['followers'], account_data['follows'], account_data['followers_to_follows_ratio'] ]]) features_scaled = scaler.transform(features) prediction = model.predict(features_scaled)[0] probability = model.predict_proba(features_scaled)[0] return { 'is_bot': bool(prediction), 'bot_probability': float(probability[1]), 'confidence': float(max(probability)) } ``` --- ## 📊 Visualization The `images/` directory contains 13 visualization plots: 1. **confusion_matrix.png** - Classification confusion matrix 2. **roc_curve.png** - ROC curve with AUC score 3. **precision_recall_curve.png** - Precision-recall trade-off 4. **feature_importance.png** - Feature importance ranking 5. **learning_curve.png** - Model learning curve 6. **class_distribution.png** - Training data class distribution 7. **prediction_distribution.png** - Prediction score distribution 8. **calibration_curve.png** - Probability calibration 9. **cv_scores.png** - Cross-validation scores 10. **top_features.png** - Top 10 features 11. **correlation_matrix.png** - Feature correlation heatmap 12. **threshold_analysis.png** - Classification threshold analysis 13. **model_comparison.png** - Baseline vs tuned model comparison --- ## 🎓 Model Training ### Training Process 1. **Data Preprocessing**: Feature engineering and normalization 2. **Train-Test Split**: 80/20 split with stratification 3. **Hyperparameter Tuning**: GridSearchCV with 5-fold cross-validation 4. **Model Selection**: Best parameters based on ROC-AUC score 5. **Evaluation**: Comprehensive metrics on held-out test set ### Cross-Validation - **Mean ROC-AUC:** 0.9988 - **Folds:** 5 - **Strategy:** Stratified K-Fold --- ## ⚠️ Limitations 1. **Data Dependency**: Model performance depends on feature quality and data accuracy 2. **Feature Availability**: All 10 features must be available for prediction 3. **Temporal Drift**: Instagram's platform and bot behavior may change over time 4. **Privacy**: Ensure compliance with Instagram's terms of service when collecting data 5. **Threshold Sensitivity**: Default threshold is 0.5; may need adjustment based on use case --- ## 📝 License This model is released under the **Apache License 2.0**. --- ## 🔄 Version History - **v2** (2025-11-27): Current version with hyperparameter tuning - ROC-AUC: 0.9990 - Accuracy: 98.60% - 10 features --- ## 📧 Contact & Citation If you use this model in your research or application, please cite: ```bibtex @misc{instagram-bot-detection-v2, title={Instagram Bot Detection Model v2}, author={Nahiar}, year={2025}, month={November}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}} } ``` --- ## 🤝 Contributing For issues, improvements, or questions, please contact the model maintainer.