nahiar's picture
Upload folder using huggingface_hub
f825cc5 verified
---
language: "en"
license: "apache-2.0"
created: "2025-11-27T05:32:51.193018Z"
tags:
- "bot-detection"
- "instagram"
- "classification"
---
# INSTAGRAM Bot Detection Model
## Overview
This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.
**Model Version:** v2
**Training Date:** 2025-11-27 11:38:28
**Framework:** scikit-learn 1.5.2
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning
---
## πŸ“Š Model Performance
### Final Metrics (Test Set)
| Metric | Score |
| --------------------- | --------------- |
| **Accuracy** | 0.9860 (98.60%) |
| **Precision** | 0.9918 (99.18%) |
| **Recall** | 0.9796 (97.96%) |
| **F1-Score** | 0.9857 (98.57%) |
| **ROC-AUC** | 0.9990 (99.90%) |
| **Average Precision** | 0.9990 (99.90%) |
### Model Improvement
- **Baseline ROC-AUC:** 0.9988
- **Tuned ROC-AUC:** 0.9990
- **Improvement:** 0.0002 (0.02%)
---
## πŸ—‚οΈ Files
| File | Description |
| -------------------------------- | -------------------------------------- |
| `instagram_bot_detection_v2.pkl` | Trained Random Forest model |
| `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization |
| `instagram_features_v2.json` | List of features used by the model |
| `instagram_metrics_v2.txt` | Detailed performance metrics report |
| `images/` | All visualization plots (13 images) |
| `README.md` | This file |
---
## 🎯 Dataset Information
### Training Configuration
- **Training Samples:** 4,000
- **Test Samples:** 1,000
- **Total Samples:** 5,000
- **Features:** 10
- **Cross-Validation Folds:** 5
- **Random State:** 42
### Class Distribution
The model was trained with balanced class weights to handle any class imbalance in the dataset.
---
## πŸ”§ Model Architecture
### Algorithm
**Random Forest Classifier** - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.
### Hyperparameters (Tuned via GridSearchCV)
| Parameter | Value |
| ------------------- | -------- |
| `n_estimators` | 100 |
| `max_depth` | 15 |
| `min_samples_split` | 2 |
| `min_samples_leaf` | 1 |
| `max_features` | sqrt |
| `class_weight` | balanced |
### Feature Preprocessing
- **Scaler:** MinMaxScaler (normalizes features to [0, 1] range)
- **Missing Values:** Handled during data preprocessing
- **Feature Engineering:** Custom features derived from account metadata
---
## πŸ“ˆ Feature Importance
The model uses 10 features to detect bot accounts. Top 5 most important features:
| Rank | Feature | Importance | Description |
| ---- | ---------------------------- | ---------- | ------------------------------------------ |
| 1 | `profile_pic` | 0.3314 | Indicates if account has a profile picture |
| 2 | `followers` | 0.2313 | Number of followers |
| 3 | `username_num_ratio` | 0.1665 | Ratio of numbers in username |
| 4 | `followers_to_follows_ratio` | 0.1308 | Ratio of followers to following count |
| 5 | `follows` | 0.0923 | Number of accounts followed |
### All Features
1. `profile_pic` - Profile picture presence
2. `username_num_ratio` - Numeric character ratio in username
3. `username_is_numeric` - Username is entirely numeric
4. `fullname_words` - Number of words in full name
5. `fullname_num_ratio` - Numeric character ratio in full name
6. `is_name_number_only` - Full name contains only numbers
7. `name_equals_username` - Full name matches username
8. `followers` - Follower count
9. `follows` - Following count
10. `followers_to_follows_ratio` - Follower/following ratio
---
## πŸš€ Usage
### Prerequisites
```bash
pip install scikit-learn joblib numpy
```
### Loading the Model
```python
import joblib
import numpy as np
# Load model and scaler
model = joblib.load('instagram_bot_detection_v2.pkl')
scaler = joblib.load('instagram_scaler_v2.pkl')
# Example prediction
features = np.array([[
1, # profile_pic
0.15, # username_num_ratio
0, # username_is_numeric
2, # fullname_words
0.0, # fullname_num_ratio
0, # is_name_number_only
0, # name_equals_username
1200, # followers
300, # follows
4.0 # followers_to_follows_ratio
]])
# Scale features
features_scaled = scaler.transform(features)
# Make prediction
prediction = model.predict(features_scaled)
probability = model.predict_proba(features_scaled)
print(f"Bot: {prediction[0] == 1}")
print(f"Probability: {probability[0][1]:.4f}")
```
### API Integration
```python
def predict_instagram_bot(account_data: dict) -> dict:
"""
Predict if an Instagram account is a bot.
Args:
account_data: Dictionary with account features
Returns:
Dictionary with prediction and probability
"""
features = np.array([[
account_data['profile_pic'],
account_data['username_num_ratio'],
account_data['username_is_numeric'],
account_data['fullname_words'],
account_data['fullname_num_ratio'],
account_data['is_name_number_only'],
account_data['name_equals_username'],
account_data['followers'],
account_data['follows'],
account_data['followers_to_follows_ratio']
]])
features_scaled = scaler.transform(features)
prediction = model.predict(features_scaled)[0]
probability = model.predict_proba(features_scaled)[0]
return {
'is_bot': bool(prediction),
'bot_probability': float(probability[1]),
'confidence': float(max(probability))
}
```
---
## πŸ“Š Visualization
The `images/` directory contains 13 visualization plots:
1. **confusion_matrix.png** - Classification confusion matrix
2. **roc_curve.png** - ROC curve with AUC score
3. **precision_recall_curve.png** - Precision-recall trade-off
4. **feature_importance.png** - Feature importance ranking
5. **learning_curve.png** - Model learning curve
6. **class_distribution.png** - Training data class distribution
7. **prediction_distribution.png** - Prediction score distribution
8. **calibration_curve.png** - Probability calibration
9. **cv_scores.png** - Cross-validation scores
10. **top_features.png** - Top 10 features
11. **correlation_matrix.png** - Feature correlation heatmap
12. **threshold_analysis.png** - Classification threshold analysis
13. **model_comparison.png** - Baseline vs tuned model comparison
---
## πŸŽ“ Model Training
### Training Process
1. **Data Preprocessing**: Feature engineering and normalization
2. **Train-Test Split**: 80/20 split with stratification
3. **Hyperparameter Tuning**: GridSearchCV with 5-fold cross-validation
4. **Model Selection**: Best parameters based on ROC-AUC score
5. **Evaluation**: Comprehensive metrics on held-out test set
### Cross-Validation
- **Mean ROC-AUC:** 0.9988
- **Folds:** 5
- **Strategy:** Stratified K-Fold
---
## ⚠️ Limitations
1. **Data Dependency**: Model performance depends on feature quality and data accuracy
2. **Feature Availability**: All 10 features must be available for prediction
3. **Temporal Drift**: Instagram's platform and bot behavior may change over time
4. **Privacy**: Ensure compliance with Instagram's terms of service when collecting data
5. **Threshold Sensitivity**: Default threshold is 0.5; may need adjustment based on use case
---
## πŸ“ License
This model is released under the **Apache License 2.0**.
---
## πŸ”„ Version History
- **v2** (2025-11-27): Current version with hyperparameter tuning
- ROC-AUC: 0.9990
- Accuracy: 98.60%
- 10 features
---
## πŸ“§ Contact & Citation
If you use this model in your research or application, please cite:
```bibtex
@misc{instagram-bot-detection-v2,
title={Instagram Bot Detection Model v2},
author={Nahiar},
year={2025},
month={November},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}}
}
```
---
## 🀝 Contributing
For issues, improvements, or questions, please contact the model maintainer.