File size: 8,574 Bytes

35d1a1d
2aa2ca7
 
 
247cc2b
 
 
 
35d1a1d
76eb959
f825cc5
76eb959
f825cc5
76eb959
f825cc5

---
language: "en"
license: "apache-2.0"
created: "2025-11-27T05:32:51.193018Z"
tags:
  - "bot-detection"
  - "instagram"
  - "classification"
---

# INSTAGRAM Bot Detection Model

## Overview

This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.

**Model Version:** v2
**Training Date:** 2025-11-27 11:38:28
**Framework:** scikit-learn 1.5.2
**Algorithm:** Random Forest Classifier with GridSearchCV Hyperparameter Tuning

---

## 📊 Model Performance

### Final Metrics (Test Set)

| Metric                | Score           |
| --------------------- | --------------- |
| **Accuracy**          | 0.9860 (98.60%) |
| **Precision**         | 0.9918 (99.18%) |
| **Recall**            | 0.9796 (97.96%) |
| **F1-Score**          | 0.9857 (98.57%) |
| **ROC-AUC**           | 0.9990 (99.90%) |
| **Average Precision** | 0.9990 (99.90%) |

### Model Improvement

- **Baseline ROC-AUC:** 0.9988
- **Tuned ROC-AUC:** 0.9990
- **Improvement:** 0.0002 (0.02%)

---

## 🗂️ Files

| File                             | Description                            |
| -------------------------------- | -------------------------------------- |
| `instagram_bot_detection_v2.pkl` | Trained Random Forest model            |
| `instagram_scaler_v2.pkl`        | MinMaxScaler for feature normalization |
| `instagram_features_v2.json`     | List of features used by the model     |
| `instagram_metrics_v2.txt`       | Detailed performance metrics report    |
| `images/`                        | All visualization plots (13 images)    |
| `README.md`                      | This file                              |

---

## 🎯 Dataset Information

### Training Configuration

- **Training Samples:** 4,000
- **Test Samples:** 1,000
- **Total Samples:** 5,000
- **Features:** 10
- **Cross-Validation Folds:** 5
- **Random State:** 42

### Class Distribution

The model was trained with balanced class weights to handle any class imbalance in the dataset.

---

## 🔧 Model Architecture

### Algorithm

**Random Forest Classifier** - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.

### Hyperparameters (Tuned via GridSearchCV)

| Parameter           | Value    |
| ------------------- | -------- |
| `n_estimators`      | 100      |
| `max_depth`         | 15       |
| `min_samples_split` | 2        |
| `min_samples_leaf`  | 1        |
| `max_features`      | sqrt     |
| `class_weight`      | balanced |

### Feature Preprocessing

- **Scaler:** MinMaxScaler (normalizes features to [0, 1] range)
- **Missing Values:** Handled during data preprocessing
- **Feature Engineering:** Custom features derived from account metadata

---

## 📈 Feature Importance

The model uses 10 features to detect bot accounts. Top 5 most important features:

| Rank | Feature                      | Importance | Description                                |
| ---- | ---------------------------- | ---------- | ------------------------------------------ |
| 1    | `profile_pic`                | 0.3314     | Indicates if account has a profile picture |
| 2    | `followers`                  | 0.2313     | Number of followers                        |
| 3    | `username_num_ratio`         | 0.1665     | Ratio of numbers in username               |
| 4    | `followers_to_follows_ratio` | 0.1308     | Ratio of followers to following count      |
| 5    | `follows`                    | 0.0923     | Number of accounts followed                |

### All Features

1. `profile_pic` - Profile picture presence
2. `username_num_ratio` - Numeric character ratio in username
3. `username_is_numeric` - Username is entirely numeric
4. `fullname_words` - Number of words in full name
5. `fullname_num_ratio` - Numeric character ratio in full name
6. `is_name_number_only` - Full name contains only numbers
7. `name_equals_username` - Full name matches username
8. `followers` - Follower count
9. `follows` - Following count
10. `followers_to_follows_ratio` - Follower/following ratio

---

## 🚀 Usage

### Prerequisites

```bash
pip install scikit-learn joblib numpy
```

### Loading the Model

```python
import joblib
import numpy as np

# Load model and scaler
model = joblib.load('instagram_bot_detection_v2.pkl')
scaler = joblib.load('instagram_scaler_v2.pkl')

# Example prediction
features = np.array([[
    1,      # profile_pic
    0.15,   # username_num_ratio
    0,      # username_is_numeric
    2,      # fullname_words
    0.0,    # fullname_num_ratio
    0,      # is_name_number_only
    0,      # name_equals_username
    1200,   # followers
    300,    # follows
    4.0     # followers_to_follows_ratio
]])

# Scale features
features_scaled = scaler.transform(features)

# Make prediction
prediction = model.predict(features_scaled)
probability = model.predict_proba(features_scaled)

print(f"Bot: {prediction[0] == 1}")
print(f"Probability: {probability[0][1]:.4f}")
```

### API Integration

```python
def predict_instagram_bot(account_data: dict) -> dict:
    """
    Predict if an Instagram account is a bot.

    Args:
        account_data: Dictionary with account features

    Returns:
        Dictionary with prediction and probability
    """
    features = np.array([[
        account_data['profile_pic'],
        account_data['username_num_ratio'],
        account_data['username_is_numeric'],
        account_data['fullname_words'],
        account_data['fullname_num_ratio'],
        account_data['is_name_number_only'],
        account_data['name_equals_username'],
        account_data['followers'],
        account_data['follows'],
        account_data['followers_to_follows_ratio']
    ]])

    features_scaled = scaler.transform(features)
    prediction = model.predict(features_scaled)[0]
    probability = model.predict_proba(features_scaled)[0]

    return {
        'is_bot': bool(prediction),
        'bot_probability': float(probability[1]),
        'confidence': float(max(probability))
    }
```

---

## 📊 Visualization

The `images/` directory contains 13 visualization plots:

1. **confusion_matrix.png** - Classification confusion matrix
2. **roc_curve.png** - ROC curve with AUC score
3. **precision_recall_curve.png** - Precision-recall trade-off
4. **feature_importance.png** - Feature importance ranking
5. **learning_curve.png** - Model learning curve
6. **class_distribution.png** - Training data class distribution
7. **prediction_distribution.png** - Prediction score distribution
8. **calibration_curve.png** - Probability calibration
9. **cv_scores.png** - Cross-validation scores
10. **top_features.png** - Top 10 features
11. **correlation_matrix.png** - Feature correlation heatmap
12. **threshold_analysis.png** - Classification threshold analysis
13. **model_comparison.png** - Baseline vs tuned model comparison

---

## 🎓 Model Training

### Training Process

1. **Data Preprocessing**: Feature engineering and normalization
2. **Train-Test Split**: 80/20 split with stratification
3. **Hyperparameter Tuning**: GridSearchCV with 5-fold cross-validation
4. **Model Selection**: Best parameters based on ROC-AUC score
5. **Evaluation**: Comprehensive metrics on held-out test set

### Cross-Validation

- **Mean ROC-AUC:** 0.9988
- **Folds:** 5
- **Strategy:** Stratified K-Fold

---

## ⚠️ Limitations

1. **Data Dependency**: Model performance depends on feature quality and data accuracy
2. **Feature Availability**: All 10 features must be available for prediction
3. **Temporal Drift**: Instagram's platform and bot behavior may change over time
4. **Privacy**: Ensure compliance with Instagram's terms of service when collecting data
5. **Threshold Sensitivity**: Default threshold is 0.5; may need adjustment based on use case

---

## 📝 License

This model is released under the **Apache License 2.0**.

---

## 🔄 Version History

- **v2** (2025-11-27): Current version with hyperparameter tuning
  - ROC-AUC: 0.9990
  - Accuracy: 98.60%
  - 10 features

---

## 📧 Contact & Citation

If you use this model in your research or application, please cite:

```bibtex
@misc{instagram-bot-detection-v2,
  title={Instagram Bot Detection Model v2},
  author={Nahiar},
  year={2025},
  month={November},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}}
}
```

---

## 🤝 Contributing

For issues, improvements, or questions, please contact the model maintainer.