Upload folder using huggingface_hub

f825cc5 verified about 1 month ago

8.57 kB

	---
	language: "en"
	license: "apache-2.0"
	created: "2025-11-27T05:32:51.193018Z"
	tags:
	- "bot-detection"
	- "instagram"
	- "classification"
	---

	# INSTAGRAM Bot Detection Model

	## Overview

	This directory contains a trained Random Forest classifier for detecting bot accounts on Instagram.

	Model Version: v2
	Training Date: 2025-11-27 11:38:28
	Framework: scikit-learn 1.5.2
	Algorithm: Random Forest Classifier with GridSearchCV Hyperparameter Tuning

	---

	## 📊 Model Performance

	### Final Metrics (Test Set)

	\| Metric \| Score \|
	\| --------------------- \| --------------- \|
	\| Accuracy \| 0.9860 (98.60%) \|
	\| Precision \| 0.9918 (99.18%) \|
	\| Recall \| 0.9796 (97.96%) \|
	\| F1-Score \| 0.9857 (98.57%) \|
	\| ROC-AUC \| 0.9990 (99.90%) \|
	\| Average Precision \| 0.9990 (99.90%) \|

	### Model Improvement

	- Baseline ROC-AUC: 0.9988
	- Tuned ROC-AUC: 0.9990
	- Improvement: 0.0002 (0.02%)

	---

	## 🗂️ Files

	\| File \| Description \|
	\| -------------------------------- \| -------------------------------------- \|
	\| `instagram_bot_detection_v2.pkl` \| Trained Random Forest model \|
	\| `instagram_scaler_v2.pkl` \| MinMaxScaler for feature normalization \|
	\| `instagram_features_v2.json` \| List of features used by the model \|
	\| `instagram_metrics_v2.txt` \| Detailed performance metrics report \|
	\| `images/` \| All visualization plots (13 images) \|
	\| `README.md` \| This file \|

	---

	## 🎯 Dataset Information

	### Training Configuration

	- Training Samples: 4,000
	- Test Samples: 1,000
	- Total Samples: 5,000
	- Features: 10
	- Cross-Validation Folds: 5
	- Random State: 42

	### Class Distribution

	The model was trained with balanced class weights to handle any class imbalance in the dataset.

	---

	## 🔧 Model Architecture

	### Algorithm

	Random Forest Classifier - An ensemble learning method that operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of the individual trees.

	### Hyperparameters (Tuned via GridSearchCV)

	\| Parameter \| Value \|
	\| ------------------- \| -------- \|
	\| `n_estimators` \| 100 \|
	\| `max_depth` \| 15 \|
	\| `min_samples_split` \| 2 \|
	\| `min_samples_leaf` \| 1 \|
	\| `max_features` \| sqrt \|
	\| `class_weight` \| balanced \|

	### Feature Preprocessing

	- Scaler: MinMaxScaler (normalizes features to [0, 1] range)
	- Missing Values: Handled during data preprocessing
	- Feature Engineering: Custom features derived from account metadata

	---

	## 📈 Feature Importance

	The model uses 10 features to detect bot accounts. Top 5 most important features:

	\| Rank \| Feature \| Importance \| Description \|
	\| ---- \| ---------------------------- \| ---------- \| ------------------------------------------ \|
	\| 1 \| `profile_pic` \| 0.3314 \| Indicates if account has a profile picture \|
	\| 2 \| `followers` \| 0.2313 \| Number of followers \|
	\| 3 \| `username_num_ratio` \| 0.1665 \| Ratio of numbers in username \|
	\| 4 \| `followers_to_follows_ratio` \| 0.1308 \| Ratio of followers to following count \|
	\| 5 \| `follows` \| 0.0923 \| Number of accounts followed \|

	### All Features

	1. `profile_pic` - Profile picture presence
	2. `username_num_ratio` - Numeric character ratio in username
	3. `username_is_numeric` - Username is entirely numeric
	4. `fullname_words` - Number of words in full name
	5. `fullname_num_ratio` - Numeric character ratio in full name
	6. `is_name_number_only` - Full name contains only numbers
	7. `name_equals_username` - Full name matches username
	8. `followers` - Follower count
	9. `follows` - Following count
	10. `followers_to_follows_ratio` - Follower/following ratio

	---

	## 🚀 Usage

	### Prerequisites

	```bash
	pip install scikit-learn joblib numpy
	```

	### Loading the Model

	```python
	import joblib
	import numpy as np

	# Load model and scaler
	model = joblib.load('instagram_bot_detection_v2.pkl')
	scaler = joblib.load('instagram_scaler_v2.pkl')

	# Example prediction
	features = np.array([[
	1, # profile_pic
	0.15, # username_num_ratio
	0, # username_is_numeric
	2, # fullname_words
	0.0, # fullname_num_ratio
	0, # is_name_number_only
	0, # name_equals_username
	1200, # followers
	300, # follows
	4.0 # followers_to_follows_ratio
	]])

	# Scale features
	features_scaled = scaler.transform(features)

	# Make prediction
	prediction = model.predict(features_scaled)
	probability = model.predict_proba(features_scaled)

	print(f"Bot: {prediction[0] == 1}")
	print(f"Probability: {probability[0][1]:.4f}")
	```

	### API Integration

	```python
	def predict_instagram_bot(account_data: dict) -> dict:
	"""
	Predict if an Instagram account is a bot.

	Args:
	account_data: Dictionary with account features

	Returns:
	Dictionary with prediction and probability
	"""
	features = np.array([[
	account_data['profile_pic'],
	account_data['username_num_ratio'],
	account_data['username_is_numeric'],
	account_data['fullname_words'],
	account_data['fullname_num_ratio'],
	account_data['is_name_number_only'],
	account_data['name_equals_username'],
	account_data['followers'],
	account_data['follows'],
	account_data['followers_to_follows_ratio']
	]])

	features_scaled = scaler.transform(features)
	prediction = model.predict(features_scaled)[0]
	probability = model.predict_proba(features_scaled)[0]

	return {
	'is_bot': bool(prediction),
	'bot_probability': float(probability[1]),
	'confidence': float(max(probability))
	}
	```

	---

	## 📊 Visualization

	The `images/` directory contains 13 visualization plots:

	1. confusion_matrix.png - Classification confusion matrix
	2. roc_curve.png - ROC curve with AUC score
	3. precision_recall_curve.png - Precision-recall trade-off
	4. feature_importance.png - Feature importance ranking
	5. learning_curve.png - Model learning curve
	6. class_distribution.png - Training data class distribution
	7. prediction_distribution.png - Prediction score distribution
	8. calibration_curve.png - Probability calibration
	9. cv_scores.png - Cross-validation scores
	10. top_features.png - Top 10 features
	11. correlation_matrix.png - Feature correlation heatmap
	12. threshold_analysis.png - Classification threshold analysis
	13. model_comparison.png - Baseline vs tuned model comparison

	---

	## 🎓 Model Training

	### Training Process

	1. Data Preprocessing: Feature engineering and normalization
	2. Train-Test Split: 80/20 split with stratification
	3. Hyperparameter Tuning: GridSearchCV with 5-fold cross-validation
	4. Model Selection: Best parameters based on ROC-AUC score
	5. Evaluation: Comprehensive metrics on held-out test set

	### Cross-Validation

	- Mean ROC-AUC: 0.9988
	- Folds: 5
	- Strategy: Stratified K-Fold

	---

	## ⚠️ Limitations

	1. Data Dependency: Model performance depends on feature quality and data accuracy
	2. Feature Availability: All 10 features must be available for prediction
	3. Temporal Drift: Instagram's platform and bot behavior may change over time
	4. Privacy: Ensure compliance with Instagram's terms of service when collecting data
	5. Threshold Sensitivity: Default threshold is 0.5; may need adjustment based on use case

	---

	## 📝 License

	This model is released under the Apache License 2.0.

	---

	## 🔄 Version History

	- v2 (2025-11-27): Current version with hyperparameter tuning
	- ROC-AUC: 0.9990
	- Accuracy: 98.60%
	- 10 features

	---

	## 📧 Contact & Citation

	If you use this model in your research or application, please cite:

	```bibtex
	@misc{instagram-bot-detection-v2,
	title={Instagram Bot Detection Model v2},
	author={Nahiar},
	year={2025},
	month={November},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/nahiar/instagram-bot-detection}}
	}
	```

	---

	## 🤝 Contributing

	For issues, improvements, or questions, please contact the model maintainer.