|
|
--- |
|
|
language: "en" |
|
|
license: "apache-2.0" |
|
|
created: "2025-11-27T05:32:51.193018Z" |
|
|
tags: |
|
|
- "bot-detection" |
|
|
- "instagram" |
|
|
- "classification" |
|
|
--- |
|
|
|
|
|
# INSTAGRAM Bot Detection Model |
|
|
|
|
|
## Overview |
|
|
|
|
|
This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram. |
|
|
|
|
|
**Model Version:** v2 |
|
|
**Training Date:** 2025-11-27 11:38:28 |
|
|
**Framework:** scikit-learn 1.5.2 |
|
|
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Performance |
|
|
|
|
|
### Final Metrics (Test Set) |
|
|
|
|
|
| Metric | Score | |
|
|
| --------------------- | --------------- | |
|
|
| **Accuracy** | 0.9860 (98.60%) | |
|
|
| **Precision** | 0.9918 (99.18%) | |
|
|
| **Recall** | 0.9796 (97.96%) | |
|
|
| **F1-Score** | 0.9857 (98.57%) | |
|
|
| **ROC-AUC** | 0.9990 (99.90%) | |
|
|
| **Average Precision** | 0.9990 (99.90%) | |
|
|
|
|
|
### Model Improvement |
|
|
|
|
|
- **Baseline ROC-AUC:** 0.9988 |
|
|
- **Tuned ROC-AUC:** 0.9990 |
|
|
- **Improvement:** 0.0002 (0.02%) |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Files |
|
|
|
|
|
| File | Description | |
|
|
| -------------------------------- | -------------------------------------- | |
|
|
| `instagram_bot_detection_v2.pkl` | Trained Random Forest model | |
|
|
| `instagram_scaler_v2.pkl` | MinMaxScaler for feature normalization | |
|
|
| `instagram_features_v2.json` | List of features used by the model | |
|
|
| `instagram_metrics_v2.txt` | Detailed performance metrics report | |
|
|
| `images/` | All visualization plots (13 images) | |
|
|
| `README.md` | This file | |
|
|
|
|
|
--- |
|
|
|
|
|
## π― Dataset Information |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Training Samples:** 4,000 |
|
|
- **Test Samples:** 1,000 |
|
|
- **Total Samples:** 5,000 |
|
|
- **Features:** 10 |
|
|
- **Cross-Validation Folds:** 5 |
|
|
- **Random State:** 42 |
|
|
|
|
|
### Class Distribution |
|
|
|
|
|
The model was trained with balanced class weights to handle any class imbalance in the dataset. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Model Architecture |
|
|
|
|
|
### Algorithm |
|
|
|
|
|
**Random Forest Classifier** - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees. |
|
|
|
|
|
### Hyperparameters (Tuned via GridSearchCV) |
|
|
|
|
|
| Parameter | Value | |
|
|
| ------------------- | -------- | |
|
|
| `n_estimators` | 100 | |
|
|
| `max_depth` | 15 | |
|
|
| `min_samples_split` | 2 | |
|
|
| `min_samples_leaf` | 1 | |
|
|
| `max_features` | sqrt | |
|
|
| `class_weight` | balanced | |
|
|
|
|
|
### Feature Preprocessing |
|
|
|
|
|
- **Scaler:** MinMaxScaler (normalizes features to [0, 1] range) |
|
|
- **Missing Values:** Handled during data preprocessing |
|
|
- **Feature Engineering:** Custom features derived from account metadata |
|
|
|
|
|
--- |
|
|
|
|
|
## π Feature Importance |
|
|
|
|
|
The model uses 10 features to detect bot accounts. Top 5 most important features: |
|
|
|
|
|
| Rank | Feature | Importance | Description | |
|
|
| ---- | ---------------------------- | ---------- | ------------------------------------------ | |
|
|
| 1 | `profile_pic` | 0.3314 | Indicates if account has a profile picture | |
|
|
| 2 | `followers` | 0.2313 | Number of followers | |
|
|
| 3 | `username_num_ratio` | 0.1665 | Ratio of numbers in username | |
|
|
| 4 | `followers_to_follows_ratio` | 0.1308 | Ratio of followers to following count | |
|
|
| 5 | `follows` | 0.0923 | Number of accounts followed | |
|
|
|
|
|
### All Features |
|
|
|
|
|
1. `profile_pic` - Profile picture presence |
|
|
2. `username_num_ratio` - Numeric character ratio in username |
|
|
3. `username_is_numeric` - Username is entirely numeric |
|
|
4. `fullname_words` - Number of words in full name |
|
|
5. `fullname_num_ratio` - Numeric character ratio in full name |
|
|
6. `is_name_number_only` - Full name contains only numbers |
|
|
7. `name_equals_username` - Full name matches username |
|
|
8. `followers` - Follower count |
|
|
9. `follows` - Following count |
|
|
10. `followers_to_follows_ratio` - Follower/following ratio |
|
|
|
|
|
--- |
|
|
|
|
|
## π Usage |
|
|
|
|
|
### Prerequisites |
|
|
|
|
|
```bash |
|
|
pip install scikit-learn joblib numpy |
|
|
``` |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
import joblib |
|
|
import numpy as np |
|
|
|
|
|
# Load model and scaler |
|
|
model = joblib.load('instagram_bot_detection_v2.pkl') |
|
|
scaler = joblib.load('instagram_scaler_v2.pkl') |
|
|
|
|
|
# Example prediction |
|
|
features = np.array([[ |
|
|
1, # profile_pic |
|
|
0.15, # username_num_ratio |
|
|
0, # username_is_numeric |
|
|
2, # fullname_words |
|
|
0.0, # fullname_num_ratio |
|
|
0, # is_name_number_only |
|
|
0, # name_equals_username |
|
|
1200, # followers |
|
|
300, # follows |
|
|
4.0 # followers_to_follows_ratio |
|
|
]]) |
|
|
|
|
|
# Scale features |
|
|
features_scaled = scaler.transform(features) |
|
|
|
|
|
# Make prediction |
|
|
prediction = model.predict(features_scaled) |
|
|
probability = model.predict_proba(features_scaled) |
|
|
|
|
|
print(f"Bot: {prediction[0] == 1}") |
|
|
print(f"Probability: {probability[0][1]:.4f}") |
|
|
``` |
|
|
|
|
|
### API Integration |
|
|
|
|
|
```python |
|
|
def predict_instagram_bot(account_data: dict) -> dict: |
|
|
""" |
|
|
Predict if an Instagram account is a bot. |
|
|
|
|
|
Args: |
|
|
account_data: Dictionary with account features |
|
|
|
|
|
Returns: |
|
|
Dictionary with prediction and probability |
|
|
""" |
|
|
features = np.array([[ |
|
|
account_data['profile_pic'], |
|
|
account_data['username_num_ratio'], |
|
|
account_data['username_is_numeric'], |
|
|
account_data['fullname_words'], |
|
|
account_data['fullname_num_ratio'], |
|
|
account_data['is_name_number_only'], |
|
|
account_data['name_equals_username'], |
|
|
account_data['followers'], |
|
|
account_data['follows'], |
|
|
account_data['followers_to_follows_ratio'] |
|
|
]]) |
|
|
|
|
|
features_scaled = scaler.transform(features) |
|
|
prediction = model.predict(features_scaled)[0] |
|
|
probability = model.predict_proba(features_scaled)[0] |
|
|
|
|
|
return { |
|
|
'is_bot': bool(prediction), |
|
|
'bot_probability': float(probability[1]), |
|
|
'confidence': float(max(probability)) |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Visualization |
|
|
|
|
|
The `images/` directory contains 13 visualization plots: |
|
|
|
|
|
1. **confusion_matrix.png** - Classification confusion matrix |
|
|
2. **roc_curve.png** - ROC curve with AUC score |
|
|
3. **precision_recall_curve.png** - Precision-recall trade-off |
|
|
4. **feature_importance.png** - Feature importance ranking |
|
|
5. **learning_curve.png** - Model learning curve |
|
|
6. **class_distribution.png** - Training data class distribution |
|
|
7. **prediction_distribution.png** - Prediction score distribution |
|
|
8. **calibration_curve.png** - Probability calibration |
|
|
9. **cv_scores.png** - Cross-validation scores |
|
|
10. **top_features.png** - Top 10 features |
|
|
11. **correlation_matrix.png** - Feature correlation heatmap |
|
|
12. **threshold_analysis.png** - Classification threshold analysis |
|
|
13. **model_comparison.png** - Baseline vs tuned model comparison |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Training |
|
|
|
|
|
### Training Process |
|
|
|
|
|
1. **Data Preprocessing**: Feature engineering and normalization |
|
|
2. **Train-Test Split**: 80/20 split with stratification |
|
|
3. **Hyperparameter Tuning**: GridSearchCV with 5-fold cross-validation |
|
|
4. **Model Selection**: Best parameters based on ROC-AUC score |
|
|
5. **Evaluation**: Comprehensive metrics on held-out test set |
|
|
|
|
|
### Cross-Validation |
|
|
|
|
|
- **Mean ROC-AUC:** 0.9988 |
|
|
- **Folds:** 5 |
|
|
- **Strategy:** Stratified K-Fold |
|
|
|
|
|
--- |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
1. **Data Dependency**: Model performance depends on feature quality and data accuracy |
|
|
2. **Feature Availability**: All 10 features must be available for prediction |
|
|
3. **Temporal Drift**: Instagram's platform and bot behavior may change over time |
|
|
4. **Privacy**: Ensure compliance with Instagram's terms of service when collecting data |
|
|
5. **Threshold Sensitivity**: Default threshold is 0.5; may need adjustment based on use case |
|
|
|
|
|
--- |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the **Apache License 2.0**. |
|
|
|
|
|
--- |
|
|
|
|
|
## π Version History |
|
|
|
|
|
- **v2** (2025-11-27): Current version with hyperparameter tuning |
|
|
- ROC-AUC: 0.9990 |
|
|
- Accuracy: 98.60% |
|
|
- 10 features |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Contact & Citation |
|
|
|
|
|
If you use this model in your research or application, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{instagram-bot-detection-v2, |
|
|
title={Instagram Bot Detection Model v2}, |
|
|
author={Nahiar}, |
|
|
year={2025}, |
|
|
month={November}, |
|
|
publisher={Hugging Face}, |
|
|
howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
For issues, improvements, or questions, please contact the model maintainer. |
|
|
|