Spaces:

ASHUT0SH-SiNGH
/

BotDetection

Sleeping

App Files Files Community

ASHUT0SH-SiNGH commited on 12 days ago

Commit

0fc348c

1 Parent(s): 1e25c66

Update bot detection model and features

Browse files

Files changed (9) hide show

BotDetectionEDA.ipynb +0 -0
Dataset/Readme.md +46 -0
Dataset/bot_detection_data.csv +0 -0
Dataset/testCLICK.csv +31 -0
Dataset/training_data.csv +0 -0
app.py +264 -182
bot-detection-model.ipynb +314 -1
bot_detector_model.pkl → bot_model.joblib +2 -2
requirements.txt +3 -7

BotDetectionEDA.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

Dataset/Readme.md ADDED Viewed

	@@ -0,0 +1,46 @@

+# Bot Detection Dataset 🤖🔍
+Welcome to the Bot Detection Dataset! This dataset is designed to facilitate the analysis and detection of bot accounts on Twitter. It contains a collection of user profiles and associated tweet data, along with a binary label indicating whether each user is a bot or not.
+## Dataset Information 📊
+The dataset is provided in a CSV file format named 'bot_detection_dataset.csv'. It includes the following columns:
+- User ID: Unique identifier for each user in the dataset.
+- Username: The username associated with the user.
+- Tweet: The text content of the tweet.
+- Retweet Count: The number of times the tweet has been retweeted.
+- Mention Count: The number of mentions in the tweet.
+- Follower Count: The number of followers the user has.
+- Verified: A boolean value indicating whether the user is verified or not.
+- Bot Label: A label indicating whether the user is a bot (1) or not (0).
+- Location: The location associated with the user.
+- Created At: The date and time when the tweet was created.
+- Hashtags: The hashtags associated with the tweet.
+## How to Use 📝
+1. Load the dataset: Read the 'bot_detection_dataset.csv' file into your preferred data analysis or machine learning tool/library.
+2. Preprocess the data: Perform any necessary data cleaning, handling missing values, and feature engineering.
+3. Split the data: Divide the dataset into training and testing sets.
+4. Choose a Machine Learning Algorithm: Select one or more algorithms suitable for binary classification, such as Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machines, or Neural Networks.
+5. Train the model: Train the chosen algorithm(s) on the training data.
+6. Evaluate the model: Evaluate the model's performance using appropriate evaluation metrics.
+7. Predict Bot or Not: Apply the trained model to new data to predict whether a user is a bot or not.
+## ML Algorithms for Bot Detection 🧠💡
+Several machine learning algorithms can be applied to predict bot accounts using this dataset. Some commonly used algorithms include:
+- Logistic Regression
+- Random Forest
+- Gradient Boosting (XGBoost, LightGBM)
+- Support Vector Machines (SVM)
+- Neural Networks (MLPs, CNNs)
+Experiment with different algorithms and consider performing hyperparameter tuning to optimize the model's performance.
+Remember to acknowledge the dataset source and provide appropriate citations if you use this dataset for research or analysis.
+Enjoy exploring the Bot Detection Dataset and discovering insights into Twitter bot accounts! 🚀🔍

Dataset/bot_detection_data.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

Dataset/testCLICK.csv ADDED Viewed

	@@ -0,0 +1,31 @@

+username,followers_count,friends_count,listed_count,favorites_count,statuses_count,description,location,verified,default_profile,default_profile_image,account_age (days),tweet_content
+0918Bask,10,1000,1,20,21,15years ago X.Lines24,Tokyo .Japan .,0,0,0,20,Exploring the latest in cybersecurity trends! 🔒
+1120Roll,330,485,5,3972,2660,保守見習い地元大好き人間。 経済学、電工、仏教を勉強中、ちなDeではいかんのか？ (*^◯^*),神奈川県横浜市,0,1,0,234,Just finished a deep dive into penetration testing. Exciting stuff!
+14KBBrown,166,177,0,1185,1254,Let me see what your best move is!,,0,0,0,56,Learning Bash scripting for automation. Any tips?
+wadespeters,2248,981,101,60304,202968,20. menna: #farida #nyc and the 80s actually you | Dragana,#freePalestine - rip paul,0,0,0,357,Networking basics are essential for security professionals. Stay sharp!
+191a5bd05da04dc,21,79,0,5,82,Cosmetologist,Wichita KS,0,1,0,343,TryHackMe challenges keep me engaged and learning!
+19_Joanne_87,641,1066,7,1568,12915,CHRISTIAN -Communication degree -graphic designer- makeup artist-pianist- FANGIRL: #Castle #EDBWK2MnNAA #AgentsOfShield #ESDLC #ChasingLife #SavingHope,,0,0,0,45,Reading about blockchain security. The future is decentralized!
+1Dniallprincess,1042,2000,7,19012,13676,"Live, Young, Wild and Free #crazymofo",Alaska XD,0,1,0,34,CTF challenges really test your problem-solving skills. Love it!
+1GisellePizarro,561,118,4,590,61294,Hey what's up guys? This is Giselle. I'm 21. College student and FanFiction writer all in one. :) #Rusher #Maslover and more. KENLOS/4 (07/22 and 11/11),"Antofagasta, Chile",0,0,0,567,Wireshark is a powerful tool for network analysis!
+1Nicoleromany,337,256,4,1407,4854,,,0,1,0,786,Nmap scripting is something I want to master next!
+1_DErika,421,338,5,2227,2408,I am not a perfect angel that you think you see. #Directioner #KatyCat #Mixer,,0,1,0,45,Application security is a critical skill for ethical hackers.
+29PurpleDragons,335,276,4,10570,24581,John 5: 28-29,"Apia, Samoa",0,0,0,67,Metasploit is a great tool for penetration testing!
+2cdevelopment,232,225,56,101,2132,"The 2C Digital Agency is determined to make a business in your city successful. Our only question is, will it be yours?",Midwest USA,0,0,0,30,SQL injection vulnerabilities are more common than you think!
+2hip4tv,1948,2096,88,3,10354,KTVU Photojournalist looking for the scoop.  News is in the eye of the beholder.  I hope you like what you see here on my twitter.,"Bay Area, Ca.",0,0,0,2,Red teaming vs. blue teaming – both sides are fascinating!
+3shaa_,271,216,0,8487,18484,"Jeddy, coffee & cheese",,0,0,0,43,Bug bounty hunting requires patience and skill. Respect to all hunters!
+510Daniel,16,74,0,397,88,Oakland born and raised!! SJSU Graduate #ServingAndProtecting My 510 ....Interesting Random Fact: I enjoy meteorology and science!!,"Bay Area, California",0,0,0,23,"Cybersecurity is a journey, not a destination. Keep learning!"
+davideb66,22,40,0,1,1299,,,0,1,1,90,Exploring the latest in cybersecurity trends! 🔒
+ElisaDospina,12561,3442,110,16358,18665,"Autrice del libro #unavitatuttacurve dal 9 aprile in tutte le librerie.Top model #curvy, su @Raidue tutor di #moda per @dettofattorai2",Italy,0,0,0,34,Just finished a deep dive into penetration testing. Exciting stuff!
+Vladimir65,600,755,6,14,22987,[Live Long and Prosper],"iPhone: 45.471680,9.192429",0,0,0,21,Learning Bash scripting for automation. Any tips?
+RafielaMorales,398,350,2,11,7975,"Cuasi Odontologa*♥,#Bipolar, #Sarcastica & Some might say im a BiTch but I'm just a Free beast in a Wild life.- #1God'sFan, Dreamer & Music Believer~","ÜT: 18.4698712,-69.9327525",0,0,0,25,Networking basics are essential for security professionals. Stay sharp!
+FabrizioC_c,413,405,8,162,20218,"I shall rise from my own death, to avenge hers with all the powers of darkness.",Firenze,0,0,0,896,TryHackMe challenges keep me engaged and learning!
+Marianocrt,134,401,1,55,15259,O scrivi Italia o scrivi libertà. Due termini distanti come la Costituzione formale e quella materiale!,,0,0,0,45,Reading about blockchain security. The future is decentralized!
+marzia_hayley,337,630,1,655,9551,paramore 10/06/13 ♥♥ - Tonight Alive - TVD -TO - OUAT- Revenge - TW - SPN ecc.,roma,0,0,0,14,CTF challenges really test your problem-solving skills. Love it!
+RobertoBoscaini,28,105,0,38,206,"Appassionato di manga, anime, cimema, serie tv, wrestling, sport, Giappone.",Cave (RM),0,0,0,147,Wireshark is a powerful tool for network analysis!
+ilsaggiolibro,2617,52,28,0,93793,,,0,0,0,236,Nmap scripting is something I want to master next!
+RosannaPilano,1561,2001,0,0,490,"Focosa, onesta, sincera. Mai tradire.",Milano,0,0,0,46,Application security is a critical skill for ethical hackers.
+Camillesr78,2355,2074,3,0,450,"I love to travel, go on long walks on a gorgeous day, grill out with friends, read a good book, anything involving the water, see live music or an occasional",San Francisco,0,0,0,81,Metasploit is a great tool for penetration testing!
+Esteryr81,4772,5167,5,0,507,"La mia vita è una festa, ma anche quella di una donna riflessiva. Scegliete la parte che vi piace di più.",Cagliari,0,0,0,29,SQL injection vulnerabilities are more common than you think!
+Moniqueeo84,5772,6022,7,0,513,"Molto socievole, amo la cucina, il vino, il calcio e gli amici.",Emilia Romagna,0,0,0,51,Red teaming vs. blue teaming – both sides are fascinating!
+EsterWalshgm75,124,0,0,0,311,"I've been described as the life of the party as well as a deep thinker. I like to have fun, laugh, be ridiculous, or sit around with a drink and talk about th",San Jose,0,0,0,21,Bug bounty hunting requires patience and skill. Respect to all hunters!
+Adelabx71,4375,4777,5,0,471,L'apparenza non è importante.,Roma,0,0,0,24,"Cybersecurity is a journey, not a destination. Keep learning!"

Dataset/training_data.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

app.py CHANGED Viewed

@@ -1,6 +1,5 @@
 import streamlit as st
 import pandas as pd
-import pickle
 import re
 import numpy as np
 import plotly.express as px
@@ -8,10 +7,13 @@ import plotly.graph_objects as go
 from datetime import datetime
 import time
 import base64
 def get_default_robot_icon():
     return "https://raw.githubusercontent.com/FortAwesome/Font-Awesome/master/svgs/solid/robot.svg"
 # Set page configuration
 st.set_page_config(
     page_title="Twitter Bot Detector",
@@ -62,40 +64,56 @@ st.markdown("""
     </style>
     """, unsafe_allow_html=True)
 @st.cache_resource
-def load_model(model_path='bot_detector_model.pkl'):
     try:
-        with open(model_path, 'rb') as f:
-            model_components = pickle.load(f)
-        return model_components
     except FileNotFoundError:
-        st.error("Model file not found. Please ensure the model is trained and saved.")
         return None
-def make_prediction(features, tweet_content, model_components):
-    features_scaled = model_components['scaler'].transform(features)
-    behavioral_probs = model_components['behavioral_model'].predict_proba(features_scaled)[0]
-    if tweet_content and tweet_content.strip():
-        tweet_features = model_components['tweet_vectorizer'].transform([tweet_content])
-        tweet_probs = model_components['tweet_model'].predict_proba(tweet_features)[0]
-        final_probs = 0.8 * behavioral_probs + 0.2 * tweet_probs
-    else:
-        final_probs = behavioral_probs
-    prediction = (final_probs[1] > 0.5)
-    confidence = final_probs[1] if prediction else final_probs[0]
-    return prediction, confidence, final_probs
-def create_gauge_chart(confidence, prediction):
     fig = go.Figure(go.Indicator(
-        mode = "gauge+number",
-        value = confidence * 100,
-        domain = {'x': [0, 1], 'y': [0, 1]},
-        title = {'text': "Confidence Score"},
-        gauge = {
             'axis': {'range': [None, 100]},
-            'bar': {'color': "darkred" if prediction else "darkgreen"},
             'steps': [
                 {'range': [0, 33], 'color': 'lightgray'},
                 {'range': [33, 66], 'color': 'gray'},
@@ -111,11 +129,12 @@ def create_gauge_chart(confidence, prediction):
     fig.update_layout(height=300)
     return fig
 def create_probability_chart(probs):
     labels = ['Human', 'Bot']
     fig = go.Figure(data=[go.Pie(
         labels=labels,
-        values=[probs[0]*100, probs[1]*100],
         hole=.3,
         marker_colors=['#00CC96', '#EF553B']
     )])
@@ -125,12 +144,57 @@ def create_probability_chart(probs):
     )
     return fig
 def main():
     # Sidebar with extended navigation
     st.sidebar.image("piclumen-1739279351872.png", width=100)  # Replace with your logo
     st.sidebar.title("Navigation")
     page = st.sidebar.radio("Go to", ["Bot Detection", "CSV Analysis", "About", "Statistics"])
     if page == "Bot Detection":
         st.title("🤖 Twitter Bot Detection System")
         st.markdown("""
@@ -140,202 +204,213 @@ def main():
         Our system uses multiple features and sophisticated algorithms to provide accurate detection results.</p>
         </div>
         """, unsafe_allow_html=True)
-        # Load model components
-        model_components = load_model()
-        if model_components is None:
             st.stop()
         # Create tabs for individual account analysis
         tab1, tab2 = st.tabs(["📝 Input Details", "📊 Analysis Results"])
         with tab1:
             st.markdown("### Account Information")
-            col1, col2, col3 = st.columns([1,1,1])
             with col1:
                 name = st.text_input("Account Name", placeholder="@username")
                 followers_count = st.number_input("Followers Count", min_value=0)
                 friends_count = st.number_input("Friends Count", min_value=0)
                 listed_count = st.number_input("Listed Count", min_value=0)
             with col2:
                 favorites_count = st.number_input("Favorites Count", min_value=0)
                 statuses_count = st.number_input("Statuses Count", min_value=0)
                 account_age = st.number_input("Account Age (days)", min_value=0)
             with col3:
                 description = st.text_area("Profile Description")
                 location = st.text_input("Location")
             st.markdown("### Account Properties")
             prop_col1, prop_col2, prop_col3 = st.columns(3)
             with prop_col1:
                 verified = st.checkbox("Verified Account")
             with prop_col2:
                 default_profile = st.checkbox("Default Profile")
             with prop_col3:
                 default_profile_image = st.checkbox("Default Profile Image")
-            # These can be fixed or computed; here we assume True as default
             has_extended_profile = True
             has_url = True
             st.markdown("### Tweet Content")
-            tweet_content = st.text_area("Sample Tweet", height=100)
             if st.button("🔍 Analyze Account"):
                 with st.spinner('Analyzing account characteristics...'):
-                    # Prepare features for the single account
-                    features = pd.DataFrame([{
-                        'followers_count': followers_count,
-                        'friends_count': friends_count,
-                        'listed_count': listed_count,
-                        'favorites_count': favorites_count,
-                        'statuses_count': statuses_count,
-                        'verified': int(verified),
-                        'followers_friends_ratio': followers_count / (friends_count + 1),
-                        'statuses_per_day': statuses_count / (account_age + 1),
-                        'engagement_ratio': favorites_count / (statuses_count + 1),
-                        'account_age_days': account_age,
-                        'name_length': len(name),
-                        'name_has_digits': int(bool(re.search(r'\d', name))),
-                        'description_length': len(description),
-                        'has_location': int(bool(location.strip())),
-                        'has_url': True,
-                        'default_profile': int(default_profile),
-                        'default_profile_image': int(default_profile_image),
-                        'has_extended_profile': True
-                    }])
-                    # Make prediction
-                    prediction, confidence, probs = make_prediction(features, tweet_content, model_components)
-                    # Switch to results tab
                     time.sleep(1)
                     tab2.markdown("### Analysis Complete!")
                     with tab2:
-                        if prediction:
                             st.error("🤖 Bot Account Detected!")
                         else:
                             st.success("👤 Human Account Detected!")
                         metric_col1, metric_col2 = st.columns(2)
                         with metric_col1:
-                            st.plotly_chart(create_gauge_chart(confidence, prediction), use_container_width=True)
                         with metric_col2:
                             st.plotly_chart(create_probability_chart(probs), use_container_width=True)
                         st.markdown("### Feature Analysis")
-                        feature_importance = pd.DataFrame({
-                            'Feature': model_components['feature_names'],
-                            'Importance': model_components['behavioral_model'].feature_importances_
-                        }).sort_values('Importance', ascending=False)
-                        fig = px.bar(feature_importance,
-                                     x='Importance',
-                                     y='Feature',
-                                     orientation='h',
-                                     title='Feature Importance Analysis')
-                        fig.update_layout(height=400)
-                        st.plotly_chart(fig, use_container_width=True)
                         metrics_data = {
                             'Metric': ['Followers', 'Friends', 'Tweets', 'Favorites'],
                             'Count': [followers_count, friends_count, statuses_count, favorites_count]
                         }
-                        fig = px.bar(metrics_data,
-                                     x='Metric',
-                                     y='Count',
-                                     title='Account Metrics Overview',
-                                     color='Count',
-                                     color_continuous_scale='Viridis')
                         st.plotly_chart(fig, use_container_width=True)
     elif page == "CSV Analysis":
         st.title("CSV Batch Analysis")
-        st.markdown("Upload a CSV file with account data to run batch predictions.")
         uploaded_file = st.file_uploader("Upload CSV", type=["csv"])
         if uploaded_file is not None:
             data = pd.read_csv(uploaded_file)
             st.markdown("### CSV Data Preview")
             st.dataframe(data.head())
-            model_components = load_model()
-            if model_components is None:
                 st.stop()
-            # Get the feature names in the correct order from the scaler
-            feature_names = model_components['scaler'].feature_names_in_
             predictions = []
             confidences = []
-            prediction_labels = []  # New list to store emoji labels
             with st.spinner("Processing accounts..."):
                 for idx, row in data.iterrows():
-                    # Create a dictionary with all features initialized to 0
-                    feature_dict = {
-                        'followers_count': row['followers_count'],
-                        'friends_count': row['friends_count'],
-                        'listed_count': row['listed_count'],
-                        'favorites_count': row['favorites_count'],
-                        'statuses_count': row['statuses_count'],
-                        'verified': int(row['verified']),
-                        'followers_friends_ratio': row['followers_count'] / (row['friends_count'] + 1),
-                        'statuses_per_day': row['statuses_count'] / (row['account_age (days)'] + 1),
-                        'engagement_ratio': row['favorites_count'] / (row['statuses_count'] + 1),
-                        'account_age_days': row['account_age (days)'],
-                        'name_length': len(row['username']),
-                        'name_has_digits': int(bool(re.search(r'\d', row['username']))),
-                        'description_length': len(str(row['description'])),
-                        'has_location': int(bool(str(row['location']).strip())),
-                        'default_profile': int(row['default_profile']),
-                        'default_profile_image': int(row['default_profile_image']),
-                        'has_url': 0,
-                        'has_extended_profile': 0
-                    }
-                    # Create DataFrame with features in the correct order
-                    features = pd.DataFrame([{name: feature_dict.get(name, 0) for name in feature_names}])
-                    tweet_text = row['tweet_content'] if 'tweet_content' in row else ""
-                    pred, conf, _ = make_prediction(features, tweet_text, model_components)
-                    predictions.append(pred)
                     confidences.append(conf)
-                    # Add emoji based on prediction
-                    prediction_labels.append('🤖' if pred == 1 else '👤')
             data['prediction'] = predictions
             data['confidence'] = confidences
-            data['account_type'] = prediction_labels  # Add new column with emojis
             st.markdown("### Batch Prediction Results")
-            # Reorder columns to show the prediction and emoji first
-            cols = ['username', 'account_type', 'prediction', 'confidence'] + [col for col in data.columns if col not in ['username', 'account_type', 'prediction', 'confidence']]
             st.dataframe(data[cols])
-            # If ground truth labels are provided, compute evaluation metrics
             if 'label' in data.columns:
                 y_true = data['label'].tolist()
                 y_pred = [int(p) for p in predictions]
                 from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
                 f1 = f1_score(y_true, y_pred, average='weighted')
                 precision = precision_score(y_true, y_pred, average='weighted')
                 recall = recall_score(y_true, y_pred, average='weighted')
                 report = classification_report(y_true, y_pred)
                 st.markdown("### Evaluation Metrics")
                 st.write("F1 Score:", f1)
                 st.write("Precision:", precision)
                 st.write("Recall:", recall)
                 st.text(report)
     elif page == "About":
         st.title("About the Bot Detection System")
         st.markdown("""
@@ -348,7 +423,7 @@ def main():
         """, unsafe_allow_html=True)
         st.markdown("### 🔑 Key Features Analyzed")
         col1, col2 = st.columns(2)
         with col1:
             st.markdown("""
             #### Account Characteristics
@@ -356,7 +431,7 @@ def main():
             - Account age and verification status
             - Username patterns
             - Profile description analysis
             #### Behavioral Patterns
             - Posting frequency
             - Engagement rates
@@ -369,14 +444,14 @@ def main():
             - Follower-following ratio
             - Friend acquisition rate
             - Network growth patterns
             #### Content Analysis
             - Tweet sentiment
             - Language patterns
             - URL sharing frequency
             - Hashtag usage
             """)
         st.markdown("""
         <div class='info-box'>
         <h3>⚙ Technical Implementation</h3>
@@ -388,10 +463,10 @@ def main():
         </ul>
         </div>
         """, unsafe_allow_html=True)
         st.markdown("### 📊 System Performance")
         metrics_col1, metrics_col2, metrics_col3, metrics_col4 = st.columns(4)
         with metrics_col1:
             st.metric("Accuracy", "87%")
         with metrics_col2:
@@ -400,7 +475,7 @@ def main():
             st.metric("Recall", "83%")
         with metrics_col4:
             st.metric("F1 Score", "86%")
         st.markdown("""
         ### 🎯 Common Use Cases
         - *Social Media Management*: Identify and remove bot accounts
@@ -408,52 +483,58 @@ def main():
         - *Marketing*: Verify authentic engagement
         - *Security*: Protect against automated threats
         """)
     else:  # Statistics page
         st.title("System Statistics")
         col1, col2 = st.columns(2)
         with col1:
             detection_data = {
                 'Category': ['Bots', 'Humans'],
                 'Count': [737, 826]
             }
-            fig = px.pie(detection_data,
-                         values='Count',
-                         names='Category',
-                         title='Detection Distribution',
-                         color_discrete_sequence=['#FF4B4B', '#00CC96'])
             st.plotly_chart(fig, use_container_width=True)
         with col2:
             confidence_data = {
-                'Score': ['90-100%', '80-90%', '70-80%', '60-70%', '50-60%'],
-                'Count': [178, 447, 503, 352, 83]  # Total = 1563
             }
-            fig = px.bar(confidence_data,
-                         x='Score',
-                         y='Count',
-                         title='Confidence Score Distribution',
-                         color='Count',
-                         color_continuous_scale='Viridis')
             st.plotly_chart(fig, use_container_width=True)
         st.markdown("### Monthly Detection Trends")
         monthly_data = {
             'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
             'Bots Detected': [45, 52, 38, 65, 48, 76],
             'Accuracy': [92, 94, 93, 95, 94, 96]
         }
-        fig = px.line(monthly_data,
-                      x='Month',
-                      y=['Bots Detected', 'Accuracy'],
-                      title='Monthly Performance Metrics',
-                      markers=True)
         st.plotly_chart(fig, use_container_width=True)
         st.markdown("### Key System Metrics")
         metric_col1, metric_col2, metric_col3, metric_col4 = st.columns(4)
         with metric_col1:
             st.metric("Total Analyses", "1,000", "+12%")
         with metric_col2:
@@ -463,5 +544,6 @@ def main():
         with metric_col4:
             st.metric("Processing Time", "1.2s", "-0.3s")
 if __name__ == "__main__":
-    main()

 import streamlit as st
 import pandas as pd
 import re
 import numpy as np
 import plotly.express as px
 from datetime import datetime
 import time
 import base64
+import joblib
 def get_default_robot_icon():
     return "https://raw.githubusercontent.com/FortAwesome/Font-Awesome/master/svgs/solid/robot.svg"
 # Set page configuration
 st.set_page_config(
     page_title="Twitter Bot Detector",
     </style>
     """, unsafe_allow_html=True)
+# ✅ Model was trained with these 11 features (confirmed by you)
+MODEL_FEATURES = [
+    "followers_count",
+    "friends_count",
+    "listedcount",
+    "favourites_count",
+    "statuses_count",
+    "verified",
+    "default_profile",
+    "default_profile_image",
+    "has_extended_profile",
+    "follow_ratio",
+    "account_age_days",
+]
 @st.cache_resource
+def load_model(model_path="bot_model.joblib"):
     try:
+        model = joblib.load(model_path)
+        return model
     except FileNotFoundError:
+        st.error("Model file not found. Please ensure 'bot_model.joblib' exists in the project folder.")
         return None
+    except Exception as e:
+        st.error(f"Failed to load model: {e}")
+        return None
+def make_prediction(features_df, model):
+    """
+    Behavioral-only RandomForest prediction.
+    features_df MUST have the same columns used in training.
+    """
+    probs = model.predict_proba(features_df)[0]
+    pred_class = int(np.argmax(probs))  # 0 = Human, 1 = Bot
+    confidence = float(probs[pred_class])
+    return pred_class, confidence, probs
+def create_gauge_chart(confidence, prediction_is_bot):
     fig = go.Figure(go.Indicator(
+        mode="gauge+number",
+        value=confidence * 100,
+        domain={'x': [0, 1], 'y': [0, 1]},
+        title={'text': "Confidence Score"},
+        gauge={
             'axis': {'range': [None, 100]},
+            'bar': {'color': "darkred" if prediction_is_bot else "darkgreen"},
             'steps': [
                 {'range': [0, 33], 'color': 'lightgray'},
                 {'range': [33, 66], 'color': 'gray'},
     fig.update_layout(height=300)
     return fig
 def create_probability_chart(probs):
     labels = ['Human', 'Bot']
     fig = go.Figure(data=[go.Pie(
         labels=labels,
+        values=[probs[0] * 100, probs[1] * 100],
         hole=.3,
         marker_colors=['#00CC96', '#EF553B']
     )])
     )
     return fig
+def build_model_features_from_ui(
+    followers_count: int,
+    friends_count: int,
+    listed_count: int,
+    favorites_count: int,
+    statuses_count: int,
+    verified: bool,
+    default_profile: bool,
+    default_profile_image: bool,
+    has_extended_profile: bool,
+    account_age_days: int
+) -> pd.DataFrame:
+    """
+    Converts UI inputs to the EXACT schema expected by the trained RF model.
+    UI stays same, only feature mapping changes.
+    Mapping:
+    listed_count -> listedcount
+    favorites_count -> favourites_count
+    followers_friends_ratio -> follow_ratio
+    account_age -> account_age_days
+    """
+    follow_ratio = followers_count / (friends_count + 1)
+    features = pd.DataFrame([{
+        "followers_count": followers_count,
+        "friends_count": friends_count,
+        "listedcount": listed_count,
+        "favourites_count": favorites_count,
+        "statuses_count": statuses_count,
+        "verified": int(verified),
+        "default_profile": int(default_profile),
+        "default_profile_image": int(default_profile_image),
+        "has_extended_profile": int(has_extended_profile),
+        "follow_ratio": follow_ratio,
+        "account_age_days": account_age_days,
+    }])
+    # enforce correct order
+    features = features[MODEL_FEATURES]
+    return features
 def main():
     # Sidebar with extended navigation
     st.sidebar.image("piclumen-1739279351872.png", width=100)  # Replace with your logo
     st.sidebar.title("Navigation")
     page = st.sidebar.radio("Go to", ["Bot Detection", "CSV Analysis", "About", "Statistics"])
     if page == "Bot Detection":
         st.title("🤖 Twitter Bot Detection System")
         st.markdown("""
         Our system uses multiple features and sophisticated algorithms to provide accurate detection results.</p>
         </div>
         """, unsafe_allow_html=True)
+        # Load model
+        model = load_model()
+        if model is None:
             st.stop()
         # Create tabs for individual account analysis
         tab1, tab2 = st.tabs(["📝 Input Details", "📊 Analysis Results"])
         with tab1:
             st.markdown("### Account Information")
+            col1, col2, col3 = st.columns([1, 1, 1])
             with col1:
                 name = st.text_input("Account Name", placeholder="@username")
                 followers_count = st.number_input("Followers Count", min_value=0)
                 friends_count = st.number_input("Friends Count", min_value=0)
                 listed_count = st.number_input("Listed Count", min_value=0)
             with col2:
                 favorites_count = st.number_input("Favorites Count", min_value=0)
                 statuses_count = st.number_input("Statuses Count", min_value=0)
                 account_age = st.number_input("Account Age (days)", min_value=0)
             with col3:
                 description = st.text_area("Profile Description")
                 location = st.text_input("Location")
             st.markdown("### Account Properties")
             prop_col1, prop_col2, prop_col3 = st.columns(3)
             with prop_col1:
                 verified = st.checkbox("Verified Account")
             with prop_col2:
                 default_profile = st.checkbox("Default Profile")
             with prop_col3:
                 default_profile_image = st.checkbox("Default Profile Image")
+            # kept same UI logic
             has_extended_profile = True
             has_url = True
             st.markdown("### Tweet Content")
+            tweet_content = st.text_area("Sample Tweet", height=100)  # UI stays, ignored in logic
             if st.button("🔍 Analyze Account"):
                 with st.spinner('Analyzing account characteristics...'):
+                    # ✅ Build ONLY the exact 11 features your RF expects
+                    features = build_model_features_from_ui(
+                        followers_count=followers_count,
+                        friends_count=friends_count,
+                        listed_count=listed_count,
+                        favorites_count=favorites_count,
+                        statuses_count=statuses_count,
+                        verified=verified,
+                        default_profile=default_profile,
+                        default_profile_image=default_profile_image,
+                        has_extended_profile=has_extended_profile,
+                        account_age_days=account_age
+                    )
+                    # ✅ Predict
+                    pred_class, confidence, probs = make_prediction(features, model)
+                    prediction_is_bot = (pred_class == 1)
                     time.sleep(1)
                     tab2.markdown("### Analysis Complete!")
                     with tab2:
+                        if prediction_is_bot:
                             st.error("🤖 Bot Account Detected!")
                         else:
                             st.success("👤 Human Account Detected!")
                         metric_col1, metric_col2 = st.columns(2)
                         with metric_col1:
+                            st.plotly_chart(create_gauge_chart(confidence, prediction_is_bot), use_container_width=True)
                         with metric_col2:
                             st.plotly_chart(create_probability_chart(probs), use_container_width=True)
                         st.markdown("### Feature Analysis")
+                        # Feature importance (RF supports this)
+                        if hasattr(model, "feature_importances_"):
+                            feature_importance = pd.DataFrame({
+                                'Feature': MODEL_FEATURES,
+                                'Importance': model.feature_importances_
+                            }).sort_values('Importance', ascending=False)
+                            fig = px.bar(
+                                feature_importance,
+                                x='Importance',
+                                y='Feature',
+                                orientation='h',
+                                title='Feature Importance Analysis'
+                            )
+                            fig.update_layout(height=400)
+                            st.plotly_chart(fig, use_container_width=True)
+                        else:
+                            st.info("Feature importance is not available for this model type.")
                         metrics_data = {
                             'Metric': ['Followers', 'Friends', 'Tweets', 'Favorites'],
                             'Count': [followers_count, friends_count, statuses_count, favorites_count]
                         }
+                        fig = px.bar(
+                            metrics_data,
+                            x='Metric',
+                            y='Count',
+                            title='Account Metrics Overview',
+                            color='Count',
+                            color_continuous_scale='Viridis'
+                        )
                         st.plotly_chart(fig, use_container_width=True)
     elif page == "CSV Analysis":
         st.title("CSV Batch Analysis")
+        st.markdown("Upload a CSV file with account data to run batch predictions. You can use test_Click from Dataset folder of this repository.")
         uploaded_file = st.file_uploader("Upload CSV", type=["csv"])
         if uploaded_file is not None:
             data = pd.read_csv(uploaded_file)
             st.markdown("### CSV Data Preview")
             st.dataframe(data.head())
+            model = load_model()
+            if model is None:
                 st.stop()
             predictions = []
             confidences = []
+            prediction_labels = []
             with st.spinner("Processing accounts..."):
                 for idx, row in data.iterrows():
+                    # flexible column names support
+                    followers = row.get("followers_count", 0)
+                    friends = row.get("friends_count", 0)
+                    statuses = row.get("statuses_count", 0)
+                    # allow either listedcount or listed_count
+                    listed = row.get("listedcount", row.get("listed_count", 0))
+                    # allow either favourites_count or favorites_count
+                    favourites = row.get("favourites_count", row.get("favorites_count", 0))
+                    verified = int(row.get("verified", 0))
+                    default_profile = int(row.get("default_profile", 0))
+                    default_profile_image = int(row.get("default_profile_image", 0))
+                    has_extended_profile = int(row.get("has_extended_profile", 0))
+                    # allow account_age_days or "account_age (days)"
+                    age_days = row.get("account_age_days", row.get("account_age (days)", 0))
+                    # compute follow_ratio if not present
+                    follow_ratio = row.get("follow_ratio", followers / (friends + 1))
+                    features = pd.DataFrame([{
+                        "followers_count": followers,
+                        "friends_count": friends,
+                        "listedcount": listed,
+                        "favourites_count": favourites,
+                        "statuses_count": statuses,
+                        "verified": verified,
+                        "default_profile": default_profile,
+                        "default_profile_image": default_profile_image,
+                        "has_extended_profile": has_extended_profile,
+                        "follow_ratio": follow_ratio,
+                        "account_age_days": age_days,
+                    }])[MODEL_FEATURES]
+                    pred_class, conf, _ = make_prediction(features, model)
+                    predictions.append(pred_class)
                     confidences.append(conf)
+                    prediction_labels.append('🤖' if pred_class == 1 else '👤')
             data['prediction'] = predictions
             data['confidence'] = confidences
+            data['account_type'] = prediction_labels
             st.markdown("### Batch Prediction Results")
+            cols = ['username', 'account_type', 'prediction', 'confidence'] + [
+                col for col in data.columns if col not in ['username', 'account_type', 'prediction', 'confidence']
+            ]
             st.dataframe(data[cols])
+            # Optional evaluation if labels exist
             if 'label' in data.columns:
                 y_true = data['label'].tolist()
                 y_pred = [int(p) for p in predictions]
                 from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
                 f1 = f1_score(y_true, y_pred, average='weighted')
                 precision = precision_score(y_true, y_pred, average='weighted')
                 recall = recall_score(y_true, y_pred, average='weighted')
                 report = classification_report(y_true, y_pred)
                 st.markdown("### Evaluation Metrics")
                 st.write("F1 Score:", f1)
                 st.write("Precision:", precision)
                 st.write("Recall:", recall)
                 st.text(report)
     elif page == "About":
         st.title("About the Bot Detection System")
         st.markdown("""
         """, unsafe_allow_html=True)
         st.markdown("### 🔑 Key Features Analyzed")
         col1, col2 = st.columns(2)
         with col1:
             st.markdown("""
             #### Account Characteristics
             - Account age and verification status
             - Username patterns
             - Profile description analysis
             #### Behavioral Patterns
             - Posting frequency
             - Engagement rates
             - Follower-following ratio
             - Friend acquisition rate
             - Network growth patterns
             #### Content Analysis
             - Tweet sentiment
             - Language patterns
             - URL sharing frequency
             - Hashtag usage
             """)
         st.markdown("""
         <div class='info-box'>
         <h3>⚙ Technical Implementation</h3>
         </ul>
         </div>
         """, unsafe_allow_html=True)
         st.markdown("### 📊 System Performance")
         metrics_col1, metrics_col2, metrics_col3, metrics_col4 = st.columns(4)
         with metrics_col1:
             st.metric("Accuracy", "87%")
         with metrics_col2:
             st.metric("Recall", "83%")
         with metrics_col4:
             st.metric("F1 Score", "86%")
         st.markdown("""
         ### 🎯 Common Use Cases
         - *Social Media Management*: Identify and remove bot accounts
         - *Marketing*: Verify authentic engagement
         - *Security*: Protect against automated threats
         """)
     else:  # Statistics page
         st.title("System Statistics")
         col1, col2 = st.columns(2)
         with col1:
             detection_data = {
                 'Category': ['Bots', 'Humans'],
                 'Count': [737, 826]
             }
+            fig = px.pie(
+                detection_data,
+                values='Count',
+                names='Category',
+                title='Detection Distribution',
+                color_discrete_sequence=['#FF4B4B', '#00CC96']
+            )
             st.plotly_chart(fig, use_container_width=True)
         with col2:
             confidence_data = {
+                'Score': ['90-100%', '80-90%', '70-80%', '60-70%', '50-60%'],
+                'Count': [178, 447, 503, 352, 83]
             }
+            fig = px.bar(
+                confidence_data,
+                x='Score',
+                y='Count',
+                title='Confidence Score Distribution',
+                color='Count',
+                color_continuous_scale='Viridis'
+            )
             st.plotly_chart(fig, use_container_width=True)
         st.markdown("### Monthly Detection Trends")
         monthly_data = {
             'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
             'Bots Detected': [45, 52, 38, 65, 48, 76],
             'Accuracy': [92, 94, 93, 95, 94, 96]
         }
+        fig = px.line(
+            monthly_data,
+            x='Month',
+            y=['Bots Detected', 'Accuracy'],
+            title='Monthly Performance Metrics',
+            markers=True
+        )
         st.plotly_chart(fig, use_container_width=True)
         st.markdown("### Key System Metrics")
         metric_col1, metric_col2, metric_col3, metric_col4 = st.columns(4)
         with metric_col1:
             st.metric("Total Analyses", "1,000", "+12%")
         with metric_col2:
         with metric_col4:
             st.metric("Processing Time", "1.2s", "-0.3s")
 if __name__ == "__main__":
+    main()

bot-detection-model.ipynb CHANGED Viewed

	@@ -1 +1,314 @@
1	- {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.12.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":14497523,"sourceType":"datasetVersion","datasetId":9259817}],"dockerImageVersionId":31234,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import pandas as pd\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import accuracy_score, classification_report","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:42:30.467530Z","iopub.execute_input":"2026-01-16T03:42:30.469065Z","iopub.status.idle":"2026-01-16T03:42:30.474262Z","shell.execute_reply.started":"2026-01-16T03:42:30.468918Z","shell.execute_reply":"2026-01-16T03:42:30.473090Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"# DATA_PATH = \"/kaggle/input/bot-detection-data/bot_detection_data.csv\"\nDATA_PATH = \"/kaggle/input/bot-detection-data/training_data.csv\"\n\ndf = pd.read_csv(DATA_PATH)\nprint(df.shape)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:42:44.598005Z","iopub.execute_input":"2026-01-16T03:42:44.598336Z","iopub.status.idle":"2026-01-16T03:42:44.666341Z","shell.execute_reply.started":"2026-01-16T03:42:44.598308Z","shell.execute_reply":"2026-01-16T03:42:44.665147Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"df.head()","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:42:50.039522Z","iopub.execute_input":"2026-01-16T03:42:50.039918Z","iopub.status.idle":"2026-01-16T03:42:50.059844Z","shell.execute_reply.started":"2026-01-16T03:42:50.039876Z","shell.execute_reply":"2026-01-16T03:42:50.058651Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"FEATURES = [\n \"followers_count\",\n \"friends_count\",\n \"listedcount\",\n \"favourites_count\",\n \"statuses_count\",\n \"verified\",\n \"default_profile\",\n \"default_profile_image\",\n \"has_extended_profile\"\n]\n\nX = df[FEATURES].fillna(0)\ny = df[\"bot\"]","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:43:06.916688Z","iopub.execute_input":"2026-01-16T03:43:06.917403Z","iopub.status.idle":"2026-01-16T03:43:06.924961Z","shell.execute_reply.started":"2026-01-16T03:43:06.917366Z","shell.execute_reply":"2026-01-16T03:43:06.924063Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"bool_cols = [\n \"verified\",\n \"default_profile\",\n \"default_profile_image\",\n \"has_extended_profile\"\n]\n\nfor col in bool_cols:\n X[col] = X[col].astype(int)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:43:16.182880Z","iopub.execute_input":"2026-01-16T03:43:16.183239Z","iopub.status.idle":"2026-01-16T03:43:16.189999Z","shell.execute_reply.started":"2026-01-16T03:43:16.183210Z","shell.execute_reply":"2026-01-16T03:43:16.188760Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"X[\"follow_ratio\"] = X[\"followers_count\"] / (X[\"friends_count\"] + 1)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:43:52.115333Z","iopub.execute_input":"2026-01-16T03:43:52.115697Z","iopub.status.idle":"2026-01-16T03:43:52.121777Z","shell.execute_reply.started":"2026-01-16T03:43:52.115666Z","shell.execute_reply":"2026-01-16T03:43:52.120660Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"df[\"created_at\"] = pd.to_datetime(df[\"created_at\"], errors=\"coerce\")\n\nX[\"account_age_days\"] = (\n pd.Timestamp.now() - df[\"created_at\"]\n).dt.days.fillna(0)\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:57.764874Z","iopub.execute_input":"2026-01-16T03:38:57.765197Z","iopub.status.idle":"2026-01-16T03:38:57.794042Z","shell.execute_reply.started":"2026-01-16T03:38:57.765161Z","shell.execute_reply":"2026-01-16T03:38:57.793068Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n X,\n y,\n test_size=0.2,\n random_state=42\n)\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:57.795084Z","iopub.execute_input":"2026-01-16T03:38:57.795374Z","iopub.status.idle":"2026-01-16T03:38:57.817354Z","shell.execute_reply.started":"2026-01-16T03:38:57.795348Z","shell.execute_reply":"2026-01-16T03:38:57.816386Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"from sklearn.ensemble import RandomForestClassifier\n\nrf = RandomForestClassifier(\n n_estimators=300,\n max_depth=20,\n min_samples_leaf=2,\n class_weight=\"balanced\",\n random_state=42,\n n_jobs=-1\n)\n\nrf.fit(X_train, y_train)\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:57.818519Z","iopub.execute_input":"2026-01-16T03:38:57.818883Z","iopub.status.idle":"2026-01-16T03:38:59.208010Z","shell.execute_reply.started":"2026-01-16T03:38:57.818853Z","shell.execute_reply":"2026-01-16T03:38:59.207044Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"preds = rf.predict(X_test)\n\nprint(\"Accuracy:\", accuracy_score(y_test, preds))\nprint(classification_report(y_test, preds))\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:59.209455Z","iopub.execute_input":"2026-01-16T03:38:59.210120Z","iopub.status.idle":"2026-01-16T03:38:59.361078Z","shell.execute_reply.started":"2026-01-16T03:38:59.210087Z","shell.execute_reply":"2026-01-16T03:38:59.360209Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"imp = pd.DataFrame({\n \"feature\": X.columns,\n \"importance\": rf.feature_importances_\n}).sort_values(by=\"importance\", ascending=False)\n\nprint(imp)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:59.363334Z","iopub.execute_input":"2026-01-16T03:38:59.363663Z","iopub.status.idle":"2026-01-16T03:38:59.445148Z","shell.execute_reply.started":"2026-01-16T03:38:59.363633Z","shell.execute_reply":"2026-01-16T03:38:59.444321Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"","metadata":{"trusted":true},"outputs":[],"execution_count":null}]}

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:42:30.469065Z",
+     "iopub.status.busy": "2026-01-16T03:42:30.467530Z",
+     "iopub.status.idle": "2026-01-16T03:42:30.474262Z",
+     "shell.execute_reply": "2026-01-16T03:42:30.473090Z",
+     "shell.execute_reply.started": "2026-01-16T03:42:30.468918Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "from sklearn.metrics import accuracy_score, classification_report"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:42:44.598336Z",
+     "iopub.status.busy": "2026-01-16T03:42:44.598005Z",
+     "iopub.status.idle": "2026-01-16T03:42:44.666341Z",
+     "shell.execute_reply": "2026-01-16T03:42:44.665147Z",
+     "shell.execute_reply.started": "2026-01-16T03:42:44.598308Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "# DATA_PATH = \"/kaggle/input/bot-detection-data/bot_detection_data.csv\"\n",
+    "DATA_PATH = \"/kaggle/input/bot-detection-data/training_data.csv\"\n",
+    "\n",
+    "df = pd.read_csv(DATA_PATH)\n",
+    "print(df.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:42:50.039918Z",
+     "iopub.status.busy": "2026-01-16T03:42:50.039522Z",
+     "iopub.status.idle": "2026-01-16T03:42:50.059844Z",
+     "shell.execute_reply": "2026-01-16T03:42:50.058651Z",
+     "shell.execute_reply.started": "2026-01-16T03:42:50.039876Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:43:06.917403Z",
+     "iopub.status.busy": "2026-01-16T03:43:06.916688Z",
+     "iopub.status.idle": "2026-01-16T03:43:06.924961Z",
+     "shell.execute_reply": "2026-01-16T03:43:06.924063Z",
+     "shell.execute_reply.started": "2026-01-16T03:43:06.917366Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "FEATURES = [\n",
+    "    \"followers_count\",\n",
+    "    \"friends_count\",\n",
+    "    \"listedcount\",\n",
+    "    \"favourites_count\",\n",
+    "    \"statuses_count\",\n",
+    "    \"verified\",\n",
+    "    \"default_profile\",\n",
+    "    \"default_profile_image\",\n",
+    "    \"has_extended_profile\"\n",
+    "]\n",
+    "\n",
+    "X = df[FEATURES].fillna(0)\n",
+    "y = df[\"bot\"]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:43:16.183239Z",
+     "iopub.status.busy": "2026-01-16T03:43:16.182880Z",
+     "iopub.status.idle": "2026-01-16T03:43:16.189999Z",
+     "shell.execute_reply": "2026-01-16T03:43:16.188760Z",
+     "shell.execute_reply.started": "2026-01-16T03:43:16.183210Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "bool_cols = [\n",
+    "    \"verified\",\n",
+    "    \"default_profile\",\n",
+    "    \"default_profile_image\",\n",
+    "    \"has_extended_profile\"\n",
+    "]\n",
+    "\n",
+    "for col in bool_cols:\n",
+    "    X[col] = X[col].astype(int)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:43:52.115697Z",
+     "iopub.status.busy": "2026-01-16T03:43:52.115333Z",
+     "iopub.status.idle": "2026-01-16T03:43:52.121777Z",
+     "shell.execute_reply": "2026-01-16T03:43:52.120660Z",
+     "shell.execute_reply.started": "2026-01-16T03:43:52.115666Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "X[\"follow_ratio\"] = X[\"followers_count\"] / (X[\"friends_count\"] + 1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:38:57.765197Z",
+     "iopub.status.busy": "2026-01-16T03:38:57.764874Z",
+     "iopub.status.idle": "2026-01-16T03:38:57.794042Z",
+     "shell.execute_reply": "2026-01-16T03:38:57.793068Z",
+     "shell.execute_reply.started": "2026-01-16T03:38:57.765161Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "df[\"created_at\"] = pd.to_datetime(df[\"created_at\"], errors=\"coerce\")\n",
+    "\n",
+    "X[\"account_age_days\"] = (\n",
+    "    pd.Timestamp.now() - df[\"created_at\"]\n",
+    ").dt.days.fillna(0)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:38:57.795374Z",
+     "iopub.status.busy": "2026-01-16T03:38:57.795084Z",
+     "iopub.status.idle": "2026-01-16T03:38:57.817354Z",
+     "shell.execute_reply": "2026-01-16T03:38:57.816386Z",
+     "shell.execute_reply.started": "2026-01-16T03:38:57.795348Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "X_train, X_test, y_train, y_test = train_test_split(\n",
+    "    X,\n",
+    "    y,\n",
+    "    test_size=0.2,\n",
+    "    random_state=42\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:38:57.818883Z",
+     "iopub.status.busy": "2026-01-16T03:38:57.818519Z",
+     "iopub.status.idle": "2026-01-16T03:38:59.208010Z",
+     "shell.execute_reply": "2026-01-16T03:38:59.207044Z",
+     "shell.execute_reply.started": "2026-01-16T03:38:57.818853Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "\n",
+    "rf = RandomForestClassifier(\n",
+    "    n_estimators=300,\n",
+    "    max_depth=20,\n",
+    "    min_samples_leaf=2,\n",
+    "    class_weight=\"balanced\",\n",
+    "    random_state=42,\n",
+    "    n_jobs=-1\n",
+    ")\n",
+    "\n",
+    "rf.fit(X_train, y_train)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:38:59.210120Z",
+     "iopub.status.busy": "2026-01-16T03:38:59.209455Z",
+     "iopub.status.idle": "2026-01-16T03:38:59.361078Z",
+     "shell.execute_reply": "2026-01-16T03:38:59.360209Z",
+     "shell.execute_reply.started": "2026-01-16T03:38:59.210087Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "preds = rf.predict(X_test)\n",
+    "\n",
+    "print(\"Accuracy:\", accuracy_score(y_test, preds))\n",
+    "print(classification_report(y_test, preds))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "execution": {
+     "iopub.execute_input": "2026-01-16T03:38:59.363663Z",
+     "iopub.status.busy": "2026-01-16T03:38:59.363334Z",
+     "iopub.status.idle": "2026-01-16T03:38:59.445148Z",
+     "shell.execute_reply": "2026-01-16T03:38:59.444321Z",
+     "shell.execute_reply.started": "2026-01-16T03:38:59.363633Z"
+    },
+    "trusted": true
+   },
+   "outputs": [],
+   "source": [
+    "imp = pd.DataFrame({\n",
+    "    \"feature\": X.columns,\n",
+    "    \"importance\": rf.feature_importances_\n",
+    "}).sort_values(by=\"importance\", ascending=False)\n",
+    "\n",
+    "print(imp)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "trusted": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kaggle": {
+   "accelerator": "none",
+   "dataSources": [
+    {
+     "datasetId": 9259817,
+     "sourceId": 14497523,
+     "sourceType": "datasetVersion"
+    }
+   ],
+   "dockerImageVersionId": 31234,
+   "isGpuEnabled": false,
+   "isInternetEnabled": true,
+   "language": "python",
+   "sourceType": "notebook"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

bot_detector_model.pkl → bot_model.joblib RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4a0e804df2cdf0b0c78673cd4c64e0ea1e0d89f74ef678d672c1ff753cc9c92e
-size 433620210

 version https://git-lfs.github.com/spec/v1
+oid sha256:42ceefedd106c136212ada4eb5cb49325228010ebde56edc0b1379da44d23a95
+size 4234857

requirements.txt CHANGED Viewed

@@ -1,10 +1,6 @@
 streamlit
-scikit-learn
 pandas
-numpy
-seaborn
-matplotlib
-torch
 plotly
-transformers

 streamlit
 pandas
+numpy==1.26.4
+scikit-learn==1.3.2
+joblib==1.3.2
 plotly