Upload twitter-bot-detection model

Browse files

Files changed (3) hide show

.gitattributes +0 -34
README.md +315 -0
requirements.txt +4 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text






















1	*.pkl filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,315 @@

+---
+language: en
+license: mit
+tags:
+  - bot-detection
+  - twitter
+  - random-forest
+  - sklearn
+  - social-media
+  - classification
+metrics:
+  - accuracy
+  - precision
+  - recall
+  - f1
+  - roc-auc
+library_name: scikit-learn
+---
+# Twitter Bot Detection Model
+## Model Description
+This Random Forest classifier is designed to detect bot accounts on Twitter/X based on profile features and behavioral patterns. The model analyzes various account characteristics to determine whether an account is likely automated (bot) or genuine (human).
+## Model Details
+- **Model Type**: Random Forest Classifier
+- **Framework**: scikit-learn
+- **Task**: Binary Classification (Bot vs Human)
+- **Language**: Python
+- **License**: MIT
+## Performance Metrics
+The model achieves strong performance on the test dataset with optimized hyperparameters:
+- **High Accuracy**: Excellent accuracy in distinguishing bots from legitimate accounts
+- **Robust Classification**: Trained with cross-validation for reliable performance
+- **Version**: v2 (improved and optimized)
+The model has been fine-tuned specifically for Twitter's unique features and bot patterns.
+## Features Used
+The model uses the following features for prediction:
+1. **IsPrivate** - Whether the account is protected/private
+2. **IsVerified** - Whether the account has a verification badge (blue checkmark)
+3. **HasProfilePic** - Whether the account has a profile picture
+4. **FollowingCount** - Number of accounts being followed
+5. **FollowerCount** - Number of followers
+6. **HasLocation** - Whether location information is provided
+7. **HasDescription** - Whether the account has a bio/description
+8. **TweetsCount** - Total number of tweets posted
+9. **FollowToFollowerRatio** - Ratio of following to followers
+10. **AccountAge** - Age of the account (if available)
+11. **HasUrl** - Whether there's a URL in the profile
+12. **DefaultProfileImage** - Whether using default profile image
+## Intended Use
+### Primary Uses
+- Identifying potential bot accounts on Twitter/X
+- Content moderation and platform integrity
+- Research on social media bot behavior and misinformation campaigns
+- Automated account screening for spam detection
+- Election integrity and political bot detection
+### Out-of-Scope Uses
+- This model is specifically trained for Twitter/X and should not be used for other platforms without retraining
+- Should not be the sole basis for account suspension decisions
+- Not designed for real-time detection without proper infrastructure
+- Not suitable for detecting state-sponsored advanced persistent threats without additional features
+- Should not be used to target legitimate users based on behavior patterns
+## How to Use
+### Installation
+```bash
+pip install scikit-learn pandas numpy joblib
+```
+### Loading the Model
+```python
+import joblib
+import pandas as pd
+import numpy as np
+from sklearn.preprocessing import MinMaxScaler
+# Load the model
+model = joblib.load('Twitter_BOT_Detection_Model_v1.pkl')
+# Prepare your data
+features = ['IsPrivate', 'IsVerified', 'HasProfilePic', 'FollowingCount',
+            'FollowerCount', 'HasLocation', 'HasDescription', 'TweetsCount',
+            'FollowToFollowerRatio', 'AccountAge', 'HasUrl', 'DefaultProfileImage']
+# Example account data
+account_data = {
+    'IsPrivate': 0,
+    'IsVerified': 0,
+    'HasProfilePic': 1,
+    'FollowingCount': 5000,
+    'FollowerCount': 50,
+    'HasLocation': 0,
+    'HasDescription': 0,
+    'TweetsCount': 10000,
+    'FollowToFollowerRatio': 100.0,
+    'AccountAge': 30,  # days
+    'HasUrl': 1,
+    'DefaultProfileImage': 0
+}
+# Create DataFrame
+df = pd.DataFrame([account_data])
+# Scale features (use the same scaler as training)
+scaler = MinMaxScaler()
+# Note: In production, you should save and load the scaler from training
+df_scaled = scaler.fit_transform(df[features])
+# Make prediction
+prediction = model.predict(df_scaled)
+probability = model.predict_proba(df_scaled)
+print(f"Prediction: {'Bot' if prediction[0] == 1 else 'Human'}")
+print(f"Confidence - Human: {probability[0][0]:.2%}, Bot: {probability[0][1]:.2%}")
+```
+### Batch Prediction with Threshold
+```python
+# For multiple accounts
+accounts_df = pd.read_csv('twitter_accounts_to_check.csv')
+accounts_scaled = scaler.transform(accounts_df[features])
+predictions = model.predict(accounts_scaled)
+probabilities = model.predict_proba(accounts_scaled)
+# Add results to DataFrame
+accounts_df['is_bot'] = predictions
+accounts_df['bot_probability'] = probabilities[:, 1]
+# Filter by confidence threshold
+high_confidence_bots = accounts_df[accounts_df['bot_probability'] > 0.9]
+suspected_bots = accounts_df[(accounts_df['bot_probability'] > 0.7) &
+                              (accounts_df['bot_probability'] <= 0.9)]
+```
+### Integration Example
+```python
+class TwitterBotDetector:
+    def __init__(self, model_path):
+        self.model = joblib.load(model_path)
+        self.scaler = MinMaxScaler()
+        self.features = ['IsPrivate', 'IsVerified', 'HasProfilePic',
+                        'FollowingCount', 'FollowerCount', 'HasLocation',
+                        'HasDescription', 'TweetsCount', 'FollowToFollowerRatio',
+                        'AccountAge', 'HasUrl', 'DefaultProfileImage']
+    def predict(self, account_features):
+        """Predict if an account is a bot"""
+        df = pd.DataFrame([account_features])
+        df_scaled = self.scaler.fit_transform(df[self.features])
+        prediction = self.model.predict(df_scaled)[0]
+        probability = self.model.predict_proba(df_scaled)[0]
+        return {
+            'is_bot': bool(prediction),
+            'bot_probability': float(probability[1]),
+            'human_probability': float(probability[0])
+        }
+# Usage
+detector = TwitterBotDetector('Twitter_BOT_Detection_Model_v1.pkl')
+result = detector.predict(account_data)
+print(result)
+```
+## Training Data
+The model was trained on a comprehensive dataset of Twitter accounts with labeled bot/human classifications. The dataset includes:
+- Balanced distribution of bot and human accounts
+- Various bot types (spam bots, political bots, engagement bots, etc.)
+- Diverse account types, ages, and activity levels
+- Features extracted from public profile information
+**Note**: The training data is proprietary and not included in this repository.
+## Training Procedure
+### Preprocessing
+1. Feature extraction from Twitter account profiles via API
+2. Calculation of derived features (FollowToFollowerRatio, AccountAge)
+3. Handling of missing values and outliers
+4. MinMax normalization of all features to [0, 1] range
+5. Train-test split with stratification to maintain class balance
+### Hyperparameters
+- **Algorithm**: Random Forest Classifier
+- **Version**: v2 (optimized)
+- **Normalization**: MinMaxScaler
+- **Cross-validation**: Stratified K-Fold
+- **Feature Selection**: Based on domain knowledge and feature importance analysis
+The model was trained using scikit-learn's RandomForestClassifier with optimized hyperparameters selected through extensive cross-validation and grid search.
+## Limitations and Bias
+### Limitations
+- Model performance depends on the quality and accuracy of input features
+- May not generalize to new bot patterns not seen during training
+- Requires access to Twitter API for feature extraction
+- Performance may degrade over time as bot behaviors evolve rapidly
+- Limited to profile-level features; does not analyze tweet content deeply
+- May struggle with sophisticated bots that mimic human behavior closely
+- Requires regular updates due to platform changes (Twitter → X)
+### Potential Biases
+- May be biased toward bot patterns present in the training data
+- Could have temporal biases based on when training data was collected
+- May misclassify legitimate accounts with unusual behavior patterns
+- Potential bias against new accounts or accounts with low activity
+- Could reflect biases in the original labeling process
+- May have difficulty with non-English accounts if training data is primarily English
+### Recommendations
+- Regularly retrain the model with new data to capture evolving bot patterns
+- Use as part of a multi-layered detection system including content analysis
+- Implement human review for high-stakes decisions
+- Monitor for false positives and adjust classification thresholds based on use case
+- Combine with tweet content analysis, network analysis, and temporal patterns
+- Consider context (political events, trending topics) when interpreting results
+- Validate performance across different account types and languages
+## Ethical Considerations
+- This model should be used responsibly and not for harassment, doxxing, or targeting
+- Consider privacy implications when analyzing user accounts
+- Ensure compliance with Twitter/X's terms of service and relevant privacy laws (GDPR, CCPA, etc.)
+- Implement appropriate safeguards against misuse
+- Provide transparency to users about automated detection systems
+- Allow for appeals and manual review processes
+- Be aware of potential for false accusations
+- Consider impact on freedom of speech and legitimate automated accounts (news bots, etc.)
+- Monitor for discriminatory outcomes across different user groups
+## Known Issues
+- Twitter's API changes may affect feature availability
+- Platform rebranding (Twitter → X) may introduce new bot patterns
+- Changes in verification system may affect IsVerified feature utility
+## Model Card Authors
+This model card was created as part of the Bot Detection project for social media platforms.
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{twitter_bot_detection_2024,
+  title={Twitter Bot Detection Model v2},
+  author={Your Name/Organization},
+  year={2024},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/your-username/twitter-bot-detection}}
+}
+```
+## Related Models
+- [TikTok Bot Detection](https://huggingface.co/your-username/tiktok-bot-detection)
+- [Instagram Bot Detection](https://huggingface.co/your-username/instagram-bot-detection)
+## Contact
+For questions or feedback about this model, please open an issue in the repository or contact the maintainers.
+## Updates and Maintenance
+- **Version**: 2.0
+- **Last Updated**: November 2024
+- **Status**: Active
+### Changelog
+- **v2.0**: Improved hyperparameters, better cross-validation, optimized for current Twitter/X platform
+- **v1.0**: Initial release
+### Future Updates
+Future updates may include:
+- Improved feature engineering based on new platform features
+- Additional training data with recent bot patterns
+- Deep learning approaches for complex bot detection
+- Integration of tweet content analysis (NLP features)
+- Network graph analysis for coordinated bot detection
+- Temporal pattern analysis
+- Support for multilingual accounts
+- Real-time feature extraction pipeline

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+scikit-learn>=1.3.0
+pandas>=2.0.0
+numpy>=1.24.0
+joblib>=1.3.0