ASHUT0SH-SiNGH commited on
Commit
0fc348c
·
1 Parent(s): 1e25c66

Update bot detection model and features

Browse files
BotDetectionEDA.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
Dataset/Readme.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bot Detection Dataset 🤖🔍
2
+
3
+ Welcome to the Bot Detection Dataset! This dataset is designed to facilitate the analysis and detection of bot accounts on Twitter. It contains a collection of user profiles and associated tweet data, along with a binary label indicating whether each user is a bot or not.
4
+
5
+ ## Dataset Information 📊
6
+
7
+ The dataset is provided in a CSV file format named 'bot_detection_dataset.csv'. It includes the following columns:
8
+
9
+ - User ID: Unique identifier for each user in the dataset.
10
+ - Username: The username associated with the user.
11
+ - Tweet: The text content of the tweet.
12
+ - Retweet Count: The number of times the tweet has been retweeted.
13
+ - Mention Count: The number of mentions in the tweet.
14
+ - Follower Count: The number of followers the user has.
15
+ - Verified: A boolean value indicating whether the user is verified or not.
16
+ - Bot Label: A label indicating whether the user is a bot (1) or not (0).
17
+ - Location: The location associated with the user.
18
+ - Created At: The date and time when the tweet was created.
19
+ - Hashtags: The hashtags associated with the tweet.
20
+
21
+ ## How to Use 📝
22
+
23
+ 1. Load the dataset: Read the 'bot_detection_dataset.csv' file into your preferred data analysis or machine learning tool/library.
24
+ 2. Preprocess the data: Perform any necessary data cleaning, handling missing values, and feature engineering.
25
+ 3. Split the data: Divide the dataset into training and testing sets.
26
+ 4. Choose a Machine Learning Algorithm: Select one or more algorithms suitable for binary classification, such as Logistic Regression, Random Forest, Gradient Boosting, Support Vector Machines, or Neural Networks.
27
+ 5. Train the model: Train the chosen algorithm(s) on the training data.
28
+ 6. Evaluate the model: Evaluate the model's performance using appropriate evaluation metrics.
29
+ 7. Predict Bot or Not: Apply the trained model to new data to predict whether a user is a bot or not.
30
+
31
+ ## ML Algorithms for Bot Detection 🧠💡
32
+
33
+ Several machine learning algorithms can be applied to predict bot accounts using this dataset. Some commonly used algorithms include:
34
+
35
+ - Logistic Regression
36
+ - Random Forest
37
+ - Gradient Boosting (XGBoost, LightGBM)
38
+ - Support Vector Machines (SVM)
39
+ - Neural Networks (MLPs, CNNs)
40
+
41
+ Experiment with different algorithms and consider performing hyperparameter tuning to optimize the model's performance.
42
+
43
+ Remember to acknowledge the dataset source and provide appropriate citations if you use this dataset for research or analysis.
44
+
45
+ Enjoy exploring the Bot Detection Dataset and discovering insights into Twitter bot accounts! 🚀🔍
46
+
Dataset/bot_detection_data.csv ADDED
The diff for this file is too large to render. See raw diff
 
Dataset/testCLICK.csv ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ username,followers_count,friends_count,listed_count,favorites_count,statuses_count,description,location,verified,default_profile,default_profile_image,account_age (days),tweet_content
2
+ 0918Bask,10,1000,1,20,21,15years ago X.Lines24,Tokyo .Japan .,0,0,0,20,Exploring the latest in cybersecurity trends! 🔒
3
+ 1120Roll,330,485,5,3972,2660,保守見習い地元大好き人間。 経済学、電工、仏教を勉強中、ちなDeではいかんのか? (*^◯^*),神奈川県横浜市,0,1,0,234,Just finished a deep dive into penetration testing. Exciting stuff!
4
+ 14KBBrown,166,177,0,1185,1254,Let me see what your best move is!,,0,0,0,56,Learning Bash scripting for automation. Any tips?
5
+ wadespeters,2248,981,101,60304,202968,20. menna: #farida #nyc and the 80s actually you | Dragana,#freePalestine - rip paul,0,0,0,357,Networking basics are essential for security professionals. Stay sharp!
6
+ 191a5bd05da04dc,21,79,0,5,82,Cosmetologist,Wichita KS,0,1,0,343,TryHackMe challenges keep me engaged and learning!
7
+ 19_Joanne_87,641,1066,7,1568,12915,CHRISTIAN -Communication degree -graphic designer- makeup artist-pianist- FANGIRL: #Castle #EDBWK2MnNAA #AgentsOfShield #ESDLC #ChasingLife #SavingHope,,0,0,0,45,Reading about blockchain security. The future is decentralized!
8
+ 1Dniallprincess,1042,2000,7,19012,13676,"Live, Young, Wild and Free #crazymofo",Alaska XD,0,1,0,34,CTF challenges really test your problem-solving skills. Love it!
9
+ 1GisellePizarro,561,118,4,590,61294,Hey what's up guys? This is Giselle. I'm 21. College student and FanFiction writer all in one. :) #Rusher #Maslover and more. KENLOS/4 (07/22 and 11/11),"Antofagasta, Chile",0,0,0,567,Wireshark is a powerful tool for network analysis!
10
+ 1Nicoleromany,337,256,4,1407,4854,,,0,1,0,786,Nmap scripting is something I want to master next!
11
+ 1_DErika,421,338,5,2227,2408,I am not a perfect angel that you think you see. #Directioner #KatyCat #Mixer,,0,1,0,45,Application security is a critical skill for ethical hackers.
12
+ 29PurpleDragons,335,276,4,10570,24581,John 5: 28-29,"Apia, Samoa",0,0,0,67,Metasploit is a great tool for penetration testing!
13
+ 2cdevelopment,232,225,56,101,2132,"The 2C Digital Agency is determined to make a business in your city successful. Our only question is, will it be yours?",Midwest USA,0,0,0,30,SQL injection vulnerabilities are more common than you think!
14
+ 2hip4tv,1948,2096,88,3,10354,KTVU Photojournalist looking for the scoop. News is in the eye of the beholder. I hope you like what you see here on my twitter.,"Bay Area, Ca.",0,0,0,2,Red teaming vs. blue teaming – both sides are fascinating!
15
+ 3shaa_,271,216,0,8487,18484,"Jeddy, coffee & cheese",,0,0,0,43,Bug bounty hunting requires patience and skill. Respect to all hunters!
16
+ 510Daniel,16,74,0,397,88,Oakland born and raised!! SJSU Graduate #ServingAndProtecting My 510 ....Interesting Random Fact: I enjoy meteorology and science!!,"Bay Area, California",0,0,0,23,"Cybersecurity is a journey, not a destination. Keep learning!"
17
+ davideb66,22,40,0,1,1299,,,0,1,1,90,Exploring the latest in cybersecurity trends! 🔒
18
+ ElisaDospina,12561,3442,110,16358,18665,"Autrice del libro #unavitatuttacurve dal 9 aprile in tutte le librerie.Top model #curvy, su @Raidue tutor di #moda per @dettofattorai2",Italy,0,0,0,34,Just finished a deep dive into penetration testing. Exciting stuff!
19
+ Vladimir65,600,755,6,14,22987,[Live Long and Prosper],"iPhone: 45.471680,9.192429",0,0,0,21,Learning Bash scripting for automation. Any tips?
20
+ RafielaMorales,398,350,2,11,7975,"Cuasi Odontologa*♥,#Bipolar, #Sarcastica & Some might say im a BiTch but I'm just a Free beast in a Wild life.- #1God'sFan, Dreamer & Music Believer~","ÜT: 18.4698712,-69.9327525",0,0,0,25,Networking basics are essential for security professionals. Stay sharp!
21
+ FabrizioC_c,413,405,8,162,20218,"I shall rise from my own death, to avenge hers with all the powers of darkness.",Firenze,0,0,0,896,TryHackMe challenges keep me engaged and learning!
22
+ Marianocrt,134,401,1,55,15259,O scrivi Italia o scrivi libertà. Due termini distanti come la Costituzione formale e quella materiale!,,0,0,0,45,Reading about blockchain security. The future is decentralized!
23
+ marzia_hayley,337,630,1,655,9551,paramore 10/06/13 ♥♥ - Tonight Alive - TVD -TO - OUAT- Revenge - TW - SPN ecc.,roma,0,0,0,14,CTF challenges really test your problem-solving skills. Love it!
24
+ RobertoBoscaini,28,105,0,38,206,"Appassionato di manga, anime, cimema, serie tv, wrestling, sport, Giappone.",Cave (RM),0,0,0,147,Wireshark is a powerful tool for network analysis!
25
+ ilsaggiolibro,2617,52,28,0,93793,,,0,0,0,236,Nmap scripting is something I want to master next!
26
+ RosannaPilano,1561,2001,0,0,490,"Focosa, onesta, sincera. Mai tradire.",Milano,0,0,0,46,Application security is a critical skill for ethical hackers.
27
+ Camillesr78,2355,2074,3,0,450,"I love to travel, go on long walks on a gorgeous day, grill out with friends, read a good book, anything involving the water, see live music or an occasional",San Francisco,0,0,0,81,Metasploit is a great tool for penetration testing!
28
+ Esteryr81,4772,5167,5,0,507,"La mia vita è una festa, ma anche quella di una donna riflessiva. Scegliete la parte che vi piace di più.",Cagliari,0,0,0,29,SQL injection vulnerabilities are more common than you think!
29
+ Moniqueeo84,5772,6022,7,0,513,"Molto socievole, amo la cucina, il vino, il calcio e gli amici.",Emilia Romagna,0,0,0,51,Red teaming vs. blue teaming – both sides are fascinating!
30
+ EsterWalshgm75,124,0,0,0,311,"I've been described as the life of the party as well as a deep thinker. I like to have fun, laugh, be ridiculous, or sit around with a drink and talk about th",San Jose,0,0,0,21,Bug bounty hunting requires patience and skill. Respect to all hunters!
31
+ Adelabx71,4375,4777,5,0,471,L'apparenza non è importante.,Roma,0,0,0,24,"Cybersecurity is a journey, not a destination. Keep learning!"
Dataset/training_data.csv ADDED
The diff for this file is too large to render. See raw diff
 
app.py CHANGED
@@ -1,6 +1,5 @@
1
  import streamlit as st
2
  import pandas as pd
3
- import pickle
4
  import re
5
  import numpy as np
6
  import plotly.express as px
@@ -8,10 +7,13 @@ import plotly.graph_objects as go
8
  from datetime import datetime
9
  import time
10
  import base64
 
 
11
 
12
  def get_default_robot_icon():
13
  return "https://raw.githubusercontent.com/FortAwesome/Font-Awesome/master/svgs/solid/robot.svg"
14
 
 
15
  # Set page configuration
16
  st.set_page_config(
17
  page_title="Twitter Bot Detector",
@@ -62,40 +64,56 @@ st.markdown("""
62
  </style>
63
  """, unsafe_allow_html=True)
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  @st.cache_resource
66
- def load_model(model_path='bot_detector_model.pkl'):
67
  try:
68
- with open(model_path, 'rb') as f:
69
- model_components = pickle.load(f)
70
- return model_components
71
  except FileNotFoundError:
72
- st.error("Model file not found. Please ensure the model is trained and saved.")
73
  return None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- def make_prediction(features, tweet_content, model_components):
76
- features_scaled = model_components['scaler'].transform(features)
77
- behavioral_probs = model_components['behavioral_model'].predict_proba(features_scaled)[0]
78
-
79
- if tweet_content and tweet_content.strip():
80
- tweet_features = model_components['tweet_vectorizer'].transform([tweet_content])
81
- tweet_probs = model_components['tweet_model'].predict_proba(tweet_features)[0]
82
- final_probs = 0.8 * behavioral_probs + 0.2 * tweet_probs
83
- else:
84
- final_probs = behavioral_probs
85
-
86
- prediction = (final_probs[1] > 0.5)
87
- confidence = final_probs[1] if prediction else final_probs[0]
88
- return prediction, confidence, final_probs
89
-
90
- def create_gauge_chart(confidence, prediction):
91
  fig = go.Figure(go.Indicator(
92
- mode = "gauge+number",
93
- value = confidence * 100,
94
- domain = {'x': [0, 1], 'y': [0, 1]},
95
- title = {'text': "Confidence Score"},
96
- gauge = {
97
  'axis': {'range': [None, 100]},
98
- 'bar': {'color': "darkred" if prediction else "darkgreen"},
99
  'steps': [
100
  {'range': [0, 33], 'color': 'lightgray'},
101
  {'range': [33, 66], 'color': 'gray'},
@@ -111,11 +129,12 @@ def create_gauge_chart(confidence, prediction):
111
  fig.update_layout(height=300)
112
  return fig
113
 
 
114
  def create_probability_chart(probs):
115
  labels = ['Human', 'Bot']
116
  fig = go.Figure(data=[go.Pie(
117
  labels=labels,
118
- values=[probs[0]*100, probs[1]*100],
119
  hole=.3,
120
  marker_colors=['#00CC96', '#EF553B']
121
  )])
@@ -125,12 +144,57 @@ def create_probability_chart(probs):
125
  )
126
  return fig
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  def main():
129
  # Sidebar with extended navigation
130
  st.sidebar.image("piclumen-1739279351872.png", width=100) # Replace with your logo
131
  st.sidebar.title("Navigation")
132
  page = st.sidebar.radio("Go to", ["Bot Detection", "CSV Analysis", "About", "Statistics"])
133
-
134
  if page == "Bot Detection":
135
  st.title("🤖 Twitter Bot Detection System")
136
  st.markdown("""
@@ -140,202 +204,213 @@ def main():
140
  Our system uses multiple features and sophisticated algorithms to provide accurate detection results.</p>
141
  </div>
142
  """, unsafe_allow_html=True)
143
- # Load model components
144
- model_components = load_model()
145
-
146
- if model_components is None:
147
  st.stop()
148
-
149
  # Create tabs for individual account analysis
150
  tab1, tab2 = st.tabs(["📝 Input Details", "📊 Analysis Results"])
151
-
152
  with tab1:
153
  st.markdown("### Account Information")
154
-
155
- col1, col2, col3 = st.columns([1,1,1])
156
-
157
  with col1:
158
  name = st.text_input("Account Name", placeholder="@username")
159
  followers_count = st.number_input("Followers Count", min_value=0)
160
  friends_count = st.number_input("Friends Count", min_value=0)
161
  listed_count = st.number_input("Listed Count", min_value=0)
162
-
163
  with col2:
164
  favorites_count = st.number_input("Favorites Count", min_value=0)
165
  statuses_count = st.number_input("Statuses Count", min_value=0)
166
  account_age = st.number_input("Account Age (days)", min_value=0)
167
-
168
  with col3:
169
  description = st.text_area("Profile Description")
170
  location = st.text_input("Location")
171
-
172
  st.markdown("### Account Properties")
173
  prop_col1, prop_col2, prop_col3 = st.columns(3)
174
-
175
  with prop_col1:
176
  verified = st.checkbox("Verified Account")
177
  with prop_col2:
178
  default_profile = st.checkbox("Default Profile")
179
  with prop_col3:
180
  default_profile_image = st.checkbox("Default Profile Image")
181
-
182
- # These can be fixed or computed; here we assume True as default
183
  has_extended_profile = True
184
  has_url = True
185
-
186
  st.markdown("### Tweet Content")
187
- tweet_content = st.text_area("Sample Tweet", height=100)
188
-
189
  if st.button("🔍 Analyze Account"):
190
  with st.spinner('Analyzing account characteristics...'):
191
- # Prepare features for the single account
192
- features = pd.DataFrame([{
193
- 'followers_count': followers_count,
194
- 'friends_count': friends_count,
195
- 'listed_count': listed_count,
196
- 'favorites_count': favorites_count,
197
- 'statuses_count': statuses_count,
198
- 'verified': int(verified),
199
- 'followers_friends_ratio': followers_count / (friends_count + 1),
200
- 'statuses_per_day': statuses_count / (account_age + 1),
201
- 'engagement_ratio': favorites_count / (statuses_count + 1),
202
- 'account_age_days': account_age,
203
- 'name_length': len(name),
204
- 'name_has_digits': int(bool(re.search(r'\d', name))),
205
- 'description_length': len(description),
206
- 'has_location': int(bool(location.strip())),
207
- 'has_url': True,
208
- 'default_profile': int(default_profile),
209
- 'default_profile_image': int(default_profile_image),
210
- 'has_extended_profile': True
211
- }])
212
-
213
- # Make prediction
214
- prediction, confidence, probs = make_prediction(features, tweet_content, model_components)
215
-
216
- # Switch to results tab
217
  time.sleep(1)
218
  tab2.markdown("### Analysis Complete!")
219
-
220
  with tab2:
221
- if prediction:
222
  st.error("🤖 Bot Account Detected!")
223
  else:
224
  st.success("👤 Human Account Detected!")
225
-
226
  metric_col1, metric_col2 = st.columns(2)
227
-
228
  with metric_col1:
229
- st.plotly_chart(create_gauge_chart(confidence, prediction), use_container_width=True)
230
  with metric_col2:
231
  st.plotly_chart(create_probability_chart(probs), use_container_width=True)
232
-
233
  st.markdown("### Feature Analysis")
234
- feature_importance = pd.DataFrame({
235
- 'Feature': model_components['feature_names'],
236
- 'Importance': model_components['behavioral_model'].feature_importances_
237
- }).sort_values('Importance', ascending=False)
238
-
239
- fig = px.bar(feature_importance,
240
- x='Importance',
241
- y='Feature',
242
- orientation='h',
243
- title='Feature Importance Analysis')
244
- fig.update_layout(height=400)
245
- st.plotly_chart(fig, use_container_width=True)
246
-
 
 
 
 
 
 
 
247
  metrics_data = {
248
  'Metric': ['Followers', 'Friends', 'Tweets', 'Favorites'],
249
  'Count': [followers_count, friends_count, statuses_count, favorites_count]
250
  }
251
- fig = px.bar(metrics_data,
252
- x='Metric',
253
- y='Count',
254
- title='Account Metrics Overview',
255
- color='Count',
256
- color_continuous_scale='Viridis')
 
 
257
  st.plotly_chart(fig, use_container_width=True)
258
-
259
  elif page == "CSV Analysis":
260
  st.title("CSV Batch Analysis")
261
- st.markdown("Upload a CSV file with account data to run batch predictions.")
262
  uploaded_file = st.file_uploader("Upload CSV", type=["csv"])
263
-
264
  if uploaded_file is not None:
265
  data = pd.read_csv(uploaded_file)
266
  st.markdown("### CSV Data Preview")
267
  st.dataframe(data.head())
268
-
269
- model_components = load_model()
270
- if model_components is None:
271
  st.stop()
272
-
273
- # Get the feature names in the correct order from the scaler
274
- feature_names = model_components['scaler'].feature_names_in_
275
-
276
  predictions = []
277
  confidences = []
278
- prediction_labels = [] # New list to store emoji labels
279
-
280
  with st.spinner("Processing accounts..."):
281
  for idx, row in data.iterrows():
282
- # Create a dictionary with all features initialized to 0
283
- feature_dict = {
284
- 'followers_count': row['followers_count'],
285
- 'friends_count': row['friends_count'],
286
- 'listed_count': row['listed_count'],
287
- 'favorites_count': row['favorites_count'],
288
- 'statuses_count': row['statuses_count'],
289
- 'verified': int(row['verified']),
290
- 'followers_friends_ratio': row['followers_count'] / (row['friends_count'] + 1),
291
- 'statuses_per_day': row['statuses_count'] / (row['account_age (days)'] + 1),
292
- 'engagement_ratio': row['favorites_count'] / (row['statuses_count'] + 1),
293
- 'account_age_days': row['account_age (days)'],
294
- 'name_length': len(row['username']),
295
- 'name_has_digits': int(bool(re.search(r'\d', row['username']))),
296
- 'description_length': len(str(row['description'])),
297
- 'has_location': int(bool(str(row['location']).strip())),
298
- 'default_profile': int(row['default_profile']),
299
- 'default_profile_image': int(row['default_profile_image']),
300
- 'has_url': 0,
301
- 'has_extended_profile': 0
302
- }
303
-
304
- # Create DataFrame with features in the correct order
305
- features = pd.DataFrame([{name: feature_dict.get(name, 0) for name in feature_names}])
306
-
307
- tweet_text = row['tweet_content'] if 'tweet_content' in row else ""
308
- pred, conf, _ = make_prediction(features, tweet_text, model_components)
309
- predictions.append(pred)
 
 
 
 
 
 
 
 
 
 
 
 
310
  confidences.append(conf)
311
- # Add emoji based on prediction
312
- prediction_labels.append('🤖' if pred == 1 else '👤')
313
-
314
  data['prediction'] = predictions
315
  data['confidence'] = confidences
316
- data['account_type'] = prediction_labels # Add new column with emojis
317
-
318
  st.markdown("### Batch Prediction Results")
319
- # Reorder columns to show the prediction and emoji first
320
- cols = ['username', 'account_type', 'prediction', 'confidence'] + [col for col in data.columns if col not in ['username', 'account_type', 'prediction', 'confidence']]
 
321
  st.dataframe(data[cols])
322
-
323
- # If ground truth labels are provided, compute evaluation metrics
324
  if 'label' in data.columns:
325
  y_true = data['label'].tolist()
326
  y_pred = [int(p) for p in predictions]
 
327
  from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
328
  f1 = f1_score(y_true, y_pred, average='weighted')
329
  precision = precision_score(y_true, y_pred, average='weighted')
330
  recall = recall_score(y_true, y_pred, average='weighted')
331
  report = classification_report(y_true, y_pred)
332
-
333
  st.markdown("### Evaluation Metrics")
334
  st.write("F1 Score:", f1)
335
  st.write("Precision:", precision)
336
  st.write("Recall:", recall)
337
  st.text(report)
338
-
339
  elif page == "About":
340
  st.title("About the Bot Detection System")
341
  st.markdown("""
@@ -348,7 +423,7 @@ def main():
348
  """, unsafe_allow_html=True)
349
  st.markdown("### 🔑 Key Features Analyzed")
350
  col1, col2 = st.columns(2)
351
-
352
  with col1:
353
  st.markdown("""
354
  #### Account Characteristics
@@ -356,7 +431,7 @@ def main():
356
  - Account age and verification status
357
  - Username patterns
358
  - Profile description analysis
359
-
360
  #### Behavioral Patterns
361
  - Posting frequency
362
  - Engagement rates
@@ -369,14 +444,14 @@ def main():
369
  - Follower-following ratio
370
  - Friend acquisition rate
371
  - Network growth patterns
372
-
373
  #### Content Analysis
374
  - Tweet sentiment
375
  - Language patterns
376
  - URL sharing frequency
377
  - Hashtag usage
378
  """)
379
-
380
  st.markdown("""
381
  <div class='info-box'>
382
  <h3>⚙ Technical Implementation</h3>
@@ -388,10 +463,10 @@ def main():
388
  </ul>
389
  </div>
390
  """, unsafe_allow_html=True)
391
-
392
  st.markdown("### 📊 System Performance")
393
  metrics_col1, metrics_col2, metrics_col3, metrics_col4 = st.columns(4)
394
-
395
  with metrics_col1:
396
  st.metric("Accuracy", "87%")
397
  with metrics_col2:
@@ -400,7 +475,7 @@ def main():
400
  st.metric("Recall", "83%")
401
  with metrics_col4:
402
  st.metric("F1 Score", "86%")
403
-
404
  st.markdown("""
405
  ### 🎯 Common Use Cases
406
  - *Social Media Management*: Identify and remove bot accounts
@@ -408,52 +483,58 @@ def main():
408
  - *Marketing*: Verify authentic engagement
409
  - *Security*: Protect against automated threats
410
  """)
411
-
412
  else: # Statistics page
413
  st.title("System Statistics")
414
  col1, col2 = st.columns(2)
415
-
416
  with col1:
417
  detection_data = {
418
  'Category': ['Bots', 'Humans'],
419
  'Count': [737, 826]
420
  }
421
- fig = px.pie(detection_data,
422
- values='Count',
423
- names='Category',
424
- title='Detection Distribution',
425
- color_discrete_sequence=['#FF4B4B', '#00CC96'])
 
 
426
  st.plotly_chart(fig, use_container_width=True)
427
-
428
  with col2:
429
  confidence_data = {
430
- 'Score': ['90-100%', '80-90%', '70-80%', '60-70%', '50-60%'],
431
- 'Count': [178, 447, 503, 352, 83] # Total = 1563
432
  }
433
- fig = px.bar(confidence_data,
434
- x='Score',
435
- y='Count',
436
- title='Confidence Score Distribution',
437
- color='Count',
438
- color_continuous_scale='Viridis')
 
 
439
  st.plotly_chart(fig, use_container_width=True)
440
-
441
  st.markdown("### Monthly Detection Trends")
442
  monthly_data = {
443
  'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
444
  'Bots Detected': [45, 52, 38, 65, 48, 76],
445
  'Accuracy': [92, 94, 93, 95, 94, 96]
446
  }
447
- fig = px.line(monthly_data,
448
- x='Month',
449
- y=['Bots Detected', 'Accuracy'],
450
- title='Monthly Performance Metrics',
451
- markers=True)
 
 
452
  st.plotly_chart(fig, use_container_width=True)
453
-
454
  st.markdown("### Key System Metrics")
455
  metric_col1, metric_col2, metric_col3, metric_col4 = st.columns(4)
456
-
457
  with metric_col1:
458
  st.metric("Total Analyses", "1,000", "+12%")
459
  with metric_col2:
@@ -463,5 +544,6 @@ def main():
463
  with metric_col4:
464
  st.metric("Processing Time", "1.2s", "-0.3s")
465
 
 
466
  if __name__ == "__main__":
467
- main()
 
1
  import streamlit as st
2
  import pandas as pd
 
3
  import re
4
  import numpy as np
5
  import plotly.express as px
 
7
  from datetime import datetime
8
  import time
9
  import base64
10
+ import joblib
11
+
12
 
13
  def get_default_robot_icon():
14
  return "https://raw.githubusercontent.com/FortAwesome/Font-Awesome/master/svgs/solid/robot.svg"
15
 
16
+
17
  # Set page configuration
18
  st.set_page_config(
19
  page_title="Twitter Bot Detector",
 
64
  </style>
65
  """, unsafe_allow_html=True)
66
 
67
+
68
+ # ✅ Model was trained with these 11 features (confirmed by you)
69
+ MODEL_FEATURES = [
70
+ "followers_count",
71
+ "friends_count",
72
+ "listedcount",
73
+ "favourites_count",
74
+ "statuses_count",
75
+ "verified",
76
+ "default_profile",
77
+ "default_profile_image",
78
+ "has_extended_profile",
79
+ "follow_ratio",
80
+ "account_age_days",
81
+ ]
82
+
83
+
84
  @st.cache_resource
85
+ def load_model(model_path="bot_model.joblib"):
86
  try:
87
+ model = joblib.load(model_path)
88
+ return model
 
89
  except FileNotFoundError:
90
+ st.error("Model file not found. Please ensure 'bot_model.joblib' exists in the project folder.")
91
  return None
92
+ except Exception as e:
93
+ st.error(f"Failed to load model: {e}")
94
+ return None
95
+
96
+
97
+ def make_prediction(features_df, model):
98
+ """
99
+ Behavioral-only RandomForest prediction.
100
+ features_df MUST have the same columns used in training.
101
+ """
102
+ probs = model.predict_proba(features_df)[0]
103
+ pred_class = int(np.argmax(probs)) # 0 = Human, 1 = Bot
104
+ confidence = float(probs[pred_class])
105
+ return pred_class, confidence, probs
106
 
107
+
108
+ def create_gauge_chart(confidence, prediction_is_bot):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  fig = go.Figure(go.Indicator(
110
+ mode="gauge+number",
111
+ value=confidence * 100,
112
+ domain={'x': [0, 1], 'y': [0, 1]},
113
+ title={'text': "Confidence Score"},
114
+ gauge={
115
  'axis': {'range': [None, 100]},
116
+ 'bar': {'color': "darkred" if prediction_is_bot else "darkgreen"},
117
  'steps': [
118
  {'range': [0, 33], 'color': 'lightgray'},
119
  {'range': [33, 66], 'color': 'gray'},
 
129
  fig.update_layout(height=300)
130
  return fig
131
 
132
+
133
  def create_probability_chart(probs):
134
  labels = ['Human', 'Bot']
135
  fig = go.Figure(data=[go.Pie(
136
  labels=labels,
137
+ values=[probs[0] * 100, probs[1] * 100],
138
  hole=.3,
139
  marker_colors=['#00CC96', '#EF553B']
140
  )])
 
144
  )
145
  return fig
146
 
147
+
148
+ def build_model_features_from_ui(
149
+ followers_count: int,
150
+ friends_count: int,
151
+ listed_count: int,
152
+ favorites_count: int,
153
+ statuses_count: int,
154
+ verified: bool,
155
+ default_profile: bool,
156
+ default_profile_image: bool,
157
+ has_extended_profile: bool,
158
+ account_age_days: int
159
+ ) -> pd.DataFrame:
160
+ """
161
+ Converts UI inputs to the EXACT schema expected by the trained RF model.
162
+ UI stays same, only feature mapping changes.
163
+
164
+ Mapping:
165
+ listed_count -> listedcount
166
+ favorites_count -> favourites_count
167
+ followers_friends_ratio -> follow_ratio
168
+ account_age -> account_age_days
169
+ """
170
+
171
+ follow_ratio = followers_count / (friends_count + 1)
172
+
173
+ features = pd.DataFrame([{
174
+ "followers_count": followers_count,
175
+ "friends_count": friends_count,
176
+ "listedcount": listed_count,
177
+ "favourites_count": favorites_count,
178
+ "statuses_count": statuses_count,
179
+ "verified": int(verified),
180
+ "default_profile": int(default_profile),
181
+ "default_profile_image": int(default_profile_image),
182
+ "has_extended_profile": int(has_extended_profile),
183
+ "follow_ratio": follow_ratio,
184
+ "account_age_days": account_age_days,
185
+ }])
186
+
187
+ # enforce correct order
188
+ features = features[MODEL_FEATURES]
189
+ return features
190
+
191
+
192
  def main():
193
  # Sidebar with extended navigation
194
  st.sidebar.image("piclumen-1739279351872.png", width=100) # Replace with your logo
195
  st.sidebar.title("Navigation")
196
  page = st.sidebar.radio("Go to", ["Bot Detection", "CSV Analysis", "About", "Statistics"])
197
+
198
  if page == "Bot Detection":
199
  st.title("🤖 Twitter Bot Detection System")
200
  st.markdown("""
 
204
  Our system uses multiple features and sophisticated algorithms to provide accurate detection results.</p>
205
  </div>
206
  """, unsafe_allow_html=True)
207
+
208
+ # Load model
209
+ model = load_model()
210
+ if model is None:
211
  st.stop()
212
+
213
  # Create tabs for individual account analysis
214
  tab1, tab2 = st.tabs(["📝 Input Details", "📊 Analysis Results"])
215
+
216
  with tab1:
217
  st.markdown("### Account Information")
218
+
219
+ col1, col2, col3 = st.columns([1, 1, 1])
220
+
221
  with col1:
222
  name = st.text_input("Account Name", placeholder="@username")
223
  followers_count = st.number_input("Followers Count", min_value=0)
224
  friends_count = st.number_input("Friends Count", min_value=0)
225
  listed_count = st.number_input("Listed Count", min_value=0)
226
+
227
  with col2:
228
  favorites_count = st.number_input("Favorites Count", min_value=0)
229
  statuses_count = st.number_input("Statuses Count", min_value=0)
230
  account_age = st.number_input("Account Age (days)", min_value=0)
231
+
232
  with col3:
233
  description = st.text_area("Profile Description")
234
  location = st.text_input("Location")
235
+
236
  st.markdown("### Account Properties")
237
  prop_col1, prop_col2, prop_col3 = st.columns(3)
238
+
239
  with prop_col1:
240
  verified = st.checkbox("Verified Account")
241
  with prop_col2:
242
  default_profile = st.checkbox("Default Profile")
243
  with prop_col3:
244
  default_profile_image = st.checkbox("Default Profile Image")
245
+
246
+ # kept same UI logic
247
  has_extended_profile = True
248
  has_url = True
249
+
250
  st.markdown("### Tweet Content")
251
+ tweet_content = st.text_area("Sample Tweet", height=100) # UI stays, ignored in logic
252
+
253
  if st.button("🔍 Analyze Account"):
254
  with st.spinner('Analyzing account characteristics...'):
255
+ # Build ONLY the exact 11 features your RF expects
256
+ features = build_model_features_from_ui(
257
+ followers_count=followers_count,
258
+ friends_count=friends_count,
259
+ listed_count=listed_count,
260
+ favorites_count=favorites_count,
261
+ statuses_count=statuses_count,
262
+ verified=verified,
263
+ default_profile=default_profile,
264
+ default_profile_image=default_profile_image,
265
+ has_extended_profile=has_extended_profile,
266
+ account_age_days=account_age
267
+ )
268
+
269
+ # ✅ Predict
270
+ pred_class, confidence, probs = make_prediction(features, model)
271
+ prediction_is_bot = (pred_class == 1)
272
+
 
 
 
 
 
 
 
 
273
  time.sleep(1)
274
  tab2.markdown("### Analysis Complete!")
275
+
276
  with tab2:
277
+ if prediction_is_bot:
278
  st.error("🤖 Bot Account Detected!")
279
  else:
280
  st.success("👤 Human Account Detected!")
281
+
282
  metric_col1, metric_col2 = st.columns(2)
283
+
284
  with metric_col1:
285
+ st.plotly_chart(create_gauge_chart(confidence, prediction_is_bot), use_container_width=True)
286
  with metric_col2:
287
  st.plotly_chart(create_probability_chart(probs), use_container_width=True)
288
+
289
  st.markdown("### Feature Analysis")
290
+
291
+ # Feature importance (RF supports this)
292
+ if hasattr(model, "feature_importances_"):
293
+ feature_importance = pd.DataFrame({
294
+ 'Feature': MODEL_FEATURES,
295
+ 'Importance': model.feature_importances_
296
+ }).sort_values('Importance', ascending=False)
297
+
298
+ fig = px.bar(
299
+ feature_importance,
300
+ x='Importance',
301
+ y='Feature',
302
+ orientation='h',
303
+ title='Feature Importance Analysis'
304
+ )
305
+ fig.update_layout(height=400)
306
+ st.plotly_chart(fig, use_container_width=True)
307
+ else:
308
+ st.info("Feature importance is not available for this model type.")
309
+
310
  metrics_data = {
311
  'Metric': ['Followers', 'Friends', 'Tweets', 'Favorites'],
312
  'Count': [followers_count, friends_count, statuses_count, favorites_count]
313
  }
314
+ fig = px.bar(
315
+ metrics_data,
316
+ x='Metric',
317
+ y='Count',
318
+ title='Account Metrics Overview',
319
+ color='Count',
320
+ color_continuous_scale='Viridis'
321
+ )
322
  st.plotly_chart(fig, use_container_width=True)
323
+
324
  elif page == "CSV Analysis":
325
  st.title("CSV Batch Analysis")
326
+ st.markdown("Upload a CSV file with account data to run batch predictions. You can use test_Click from Dataset folder of this repository.")
327
  uploaded_file = st.file_uploader("Upload CSV", type=["csv"])
328
+
329
  if uploaded_file is not None:
330
  data = pd.read_csv(uploaded_file)
331
  st.markdown("### CSV Data Preview")
332
  st.dataframe(data.head())
333
+
334
+ model = load_model()
335
+ if model is None:
336
  st.stop()
337
+
 
 
 
338
  predictions = []
339
  confidences = []
340
+ prediction_labels = []
341
+
342
  with st.spinner("Processing accounts..."):
343
  for idx, row in data.iterrows():
344
+
345
+ # flexible column names support
346
+ followers = row.get("followers_count", 0)
347
+ friends = row.get("friends_count", 0)
348
+ statuses = row.get("statuses_count", 0)
349
+
350
+ # allow either listedcount or listed_count
351
+ listed = row.get("listedcount", row.get("listed_count", 0))
352
+
353
+ # allow either favourites_count or favorites_count
354
+ favourites = row.get("favourites_count", row.get("favorites_count", 0))
355
+
356
+ verified = int(row.get("verified", 0))
357
+ default_profile = int(row.get("default_profile", 0))
358
+ default_profile_image = int(row.get("default_profile_image", 0))
359
+ has_extended_profile = int(row.get("has_extended_profile", 0))
360
+
361
+ # allow account_age_days or "account_age (days)"
362
+ age_days = row.get("account_age_days", row.get("account_age (days)", 0))
363
+
364
+ # compute follow_ratio if not present
365
+ follow_ratio = row.get("follow_ratio", followers / (friends + 1))
366
+
367
+ features = pd.DataFrame([{
368
+ "followers_count": followers,
369
+ "friends_count": friends,
370
+ "listedcount": listed,
371
+ "favourites_count": favourites,
372
+ "statuses_count": statuses,
373
+ "verified": verified,
374
+ "default_profile": default_profile,
375
+ "default_profile_image": default_profile_image,
376
+ "has_extended_profile": has_extended_profile,
377
+ "follow_ratio": follow_ratio,
378
+ "account_age_days": age_days,
379
+ }])[MODEL_FEATURES]
380
+
381
+ pred_class, conf, _ = make_prediction(features, model)
382
+
383
+ predictions.append(pred_class)
384
  confidences.append(conf)
385
+ prediction_labels.append('🤖' if pred_class == 1 else '👤')
386
+
 
387
  data['prediction'] = predictions
388
  data['confidence'] = confidences
389
+ data['account_type'] = prediction_labels
390
+
391
  st.markdown("### Batch Prediction Results")
392
+ cols = ['username', 'account_type', 'prediction', 'confidence'] + [
393
+ col for col in data.columns if col not in ['username', 'account_type', 'prediction', 'confidence']
394
+ ]
395
  st.dataframe(data[cols])
396
+
397
+ # Optional evaluation if labels exist
398
  if 'label' in data.columns:
399
  y_true = data['label'].tolist()
400
  y_pred = [int(p) for p in predictions]
401
+
402
  from sklearn.metrics import f1_score, precision_score, recall_score, classification_report
403
  f1 = f1_score(y_true, y_pred, average='weighted')
404
  precision = precision_score(y_true, y_pred, average='weighted')
405
  recall = recall_score(y_true, y_pred, average='weighted')
406
  report = classification_report(y_true, y_pred)
407
+
408
  st.markdown("### Evaluation Metrics")
409
  st.write("F1 Score:", f1)
410
  st.write("Precision:", precision)
411
  st.write("Recall:", recall)
412
  st.text(report)
413
+
414
  elif page == "About":
415
  st.title("About the Bot Detection System")
416
  st.markdown("""
 
423
  """, unsafe_allow_html=True)
424
  st.markdown("### 🔑 Key Features Analyzed")
425
  col1, col2 = st.columns(2)
426
+
427
  with col1:
428
  st.markdown("""
429
  #### Account Characteristics
 
431
  - Account age and verification status
432
  - Username patterns
433
  - Profile description analysis
434
+
435
  #### Behavioral Patterns
436
  - Posting frequency
437
  - Engagement rates
 
444
  - Follower-following ratio
445
  - Friend acquisition rate
446
  - Network growth patterns
447
+
448
  #### Content Analysis
449
  - Tweet sentiment
450
  - Language patterns
451
  - URL sharing frequency
452
  - Hashtag usage
453
  """)
454
+
455
  st.markdown("""
456
  <div class='info-box'>
457
  <h3>⚙ Technical Implementation</h3>
 
463
  </ul>
464
  </div>
465
  """, unsafe_allow_html=True)
466
+
467
  st.markdown("### 📊 System Performance")
468
  metrics_col1, metrics_col2, metrics_col3, metrics_col4 = st.columns(4)
469
+
470
  with metrics_col1:
471
  st.metric("Accuracy", "87%")
472
  with metrics_col2:
 
475
  st.metric("Recall", "83%")
476
  with metrics_col4:
477
  st.metric("F1 Score", "86%")
478
+
479
  st.markdown("""
480
  ### 🎯 Common Use Cases
481
  - *Social Media Management*: Identify and remove bot accounts
 
483
  - *Marketing*: Verify authentic engagement
484
  - *Security*: Protect against automated threats
485
  """)
486
+
487
  else: # Statistics page
488
  st.title("System Statistics")
489
  col1, col2 = st.columns(2)
490
+
491
  with col1:
492
  detection_data = {
493
  'Category': ['Bots', 'Humans'],
494
  'Count': [737, 826]
495
  }
496
+ fig = px.pie(
497
+ detection_data,
498
+ values='Count',
499
+ names='Category',
500
+ title='Detection Distribution',
501
+ color_discrete_sequence=['#FF4B4B', '#00CC96']
502
+ )
503
  st.plotly_chart(fig, use_container_width=True)
504
+
505
  with col2:
506
  confidence_data = {
507
+ 'Score': ['90-100%', '80-90%', '70-80%', '60-70%', '50-60%'],
508
+ 'Count': [178, 447, 503, 352, 83]
509
  }
510
+ fig = px.bar(
511
+ confidence_data,
512
+ x='Score',
513
+ y='Count',
514
+ title='Confidence Score Distribution',
515
+ color='Count',
516
+ color_continuous_scale='Viridis'
517
+ )
518
  st.plotly_chart(fig, use_container_width=True)
519
+
520
  st.markdown("### Monthly Detection Trends")
521
  monthly_data = {
522
  'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
523
  'Bots Detected': [45, 52, 38, 65, 48, 76],
524
  'Accuracy': [92, 94, 93, 95, 94, 96]
525
  }
526
+ fig = px.line(
527
+ monthly_data,
528
+ x='Month',
529
+ y=['Bots Detected', 'Accuracy'],
530
+ title='Monthly Performance Metrics',
531
+ markers=True
532
+ )
533
  st.plotly_chart(fig, use_container_width=True)
534
+
535
  st.markdown("### Key System Metrics")
536
  metric_col1, metric_col2, metric_col3, metric_col4 = st.columns(4)
537
+
538
  with metric_col1:
539
  st.metric("Total Analyses", "1,000", "+12%")
540
  with metric_col2:
 
544
  with metric_col4:
545
  st.metric("Processing Time", "1.2s", "-0.3s")
546
 
547
+
548
  if __name__ == "__main__":
549
+ main()
bot-detection-model.ipynb CHANGED
@@ -1 +1,314 @@
1
- {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.12.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":14497523,"sourceType":"datasetVersion","datasetId":9259817}],"dockerImageVersionId":31234,"isInternetEnabled":true,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import pandas as pd\nimport numpy as np\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.metrics import accuracy_score, classification_report","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:42:30.467530Z","iopub.execute_input":"2026-01-16T03:42:30.469065Z","iopub.status.idle":"2026-01-16T03:42:30.474262Z","shell.execute_reply.started":"2026-01-16T03:42:30.468918Z","shell.execute_reply":"2026-01-16T03:42:30.473090Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"# DATA_PATH = \"/kaggle/input/bot-detection-data/bot_detection_data.csv\"\nDATA_PATH = \"/kaggle/input/bot-detection-data/training_data.csv\"\n\ndf = pd.read_csv(DATA_PATH)\nprint(df.shape)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:42:44.598005Z","iopub.execute_input":"2026-01-16T03:42:44.598336Z","iopub.status.idle":"2026-01-16T03:42:44.666341Z","shell.execute_reply.started":"2026-01-16T03:42:44.598308Z","shell.execute_reply":"2026-01-16T03:42:44.665147Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"df.head()","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:42:50.039522Z","iopub.execute_input":"2026-01-16T03:42:50.039918Z","iopub.status.idle":"2026-01-16T03:42:50.059844Z","shell.execute_reply.started":"2026-01-16T03:42:50.039876Z","shell.execute_reply":"2026-01-16T03:42:50.058651Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"FEATURES = [\n \"followers_count\",\n \"friends_count\",\n \"listedcount\",\n \"favourites_count\",\n \"statuses_count\",\n \"verified\",\n \"default_profile\",\n \"default_profile_image\",\n \"has_extended_profile\"\n]\n\nX = df[FEATURES].fillna(0)\ny = df[\"bot\"]","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:43:06.916688Z","iopub.execute_input":"2026-01-16T03:43:06.917403Z","iopub.status.idle":"2026-01-16T03:43:06.924961Z","shell.execute_reply.started":"2026-01-16T03:43:06.917366Z","shell.execute_reply":"2026-01-16T03:43:06.924063Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"bool_cols = [\n \"verified\",\n \"default_profile\",\n \"default_profile_image\",\n \"has_extended_profile\"\n]\n\nfor col in bool_cols:\n X[col] = X[col].astype(int)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:43:16.182880Z","iopub.execute_input":"2026-01-16T03:43:16.183239Z","iopub.status.idle":"2026-01-16T03:43:16.189999Z","shell.execute_reply.started":"2026-01-16T03:43:16.183210Z","shell.execute_reply":"2026-01-16T03:43:16.188760Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"X[\"follow_ratio\"] = X[\"followers_count\"] / (X[\"friends_count\"] + 1)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:43:52.115333Z","iopub.execute_input":"2026-01-16T03:43:52.115697Z","iopub.status.idle":"2026-01-16T03:43:52.121777Z","shell.execute_reply.started":"2026-01-16T03:43:52.115666Z","shell.execute_reply":"2026-01-16T03:43:52.120660Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"df[\"created_at\"] = pd.to_datetime(df[\"created_at\"], errors=\"coerce\")\n\nX[\"account_age_days\"] = (\n pd.Timestamp.now() - df[\"created_at\"]\n).dt.days.fillna(0)\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:57.764874Z","iopub.execute_input":"2026-01-16T03:38:57.765197Z","iopub.status.idle":"2026-01-16T03:38:57.794042Z","shell.execute_reply.started":"2026-01-16T03:38:57.765161Z","shell.execute_reply":"2026-01-16T03:38:57.793068Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"from sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(\n X,\n y,\n test_size=0.2,\n random_state=42\n)\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:57.795084Z","iopub.execute_input":"2026-01-16T03:38:57.795374Z","iopub.status.idle":"2026-01-16T03:38:57.817354Z","shell.execute_reply.started":"2026-01-16T03:38:57.795348Z","shell.execute_reply":"2026-01-16T03:38:57.816386Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"from sklearn.ensemble import RandomForestClassifier\n\nrf = RandomForestClassifier(\n n_estimators=300,\n max_depth=20,\n min_samples_leaf=2,\n class_weight=\"balanced\",\n random_state=42,\n n_jobs=-1\n)\n\nrf.fit(X_train, y_train)\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:57.818519Z","iopub.execute_input":"2026-01-16T03:38:57.818883Z","iopub.status.idle":"2026-01-16T03:38:59.208010Z","shell.execute_reply.started":"2026-01-16T03:38:57.818853Z","shell.execute_reply":"2026-01-16T03:38:59.207044Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"preds = rf.predict(X_test)\n\nprint(\"Accuracy:\", accuracy_score(y_test, preds))\nprint(classification_report(y_test, preds))\n","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:59.209455Z","iopub.execute_input":"2026-01-16T03:38:59.210120Z","iopub.status.idle":"2026-01-16T03:38:59.361078Z","shell.execute_reply.started":"2026-01-16T03:38:59.210087Z","shell.execute_reply":"2026-01-16T03:38:59.360209Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"imp = pd.DataFrame({\n \"feature\": X.columns,\n \"importance\": rf.feature_importances_\n}).sort_values(by=\"importance\", ascending=False)\n\nprint(imp)","metadata":{"trusted":true,"execution":{"iopub.status.busy":"2026-01-16T03:38:59.363334Z","iopub.execute_input":"2026-01-16T03:38:59.363663Z","iopub.status.idle":"2026-01-16T03:38:59.445148Z","shell.execute_reply.started":"2026-01-16T03:38:59.363633Z","shell.execute_reply":"2026-01-16T03:38:59.444321Z"}},"outputs":[],"execution_count":null},{"cell_type":"code","source":"","metadata":{"trusted":true},"outputs":[],"execution_count":null}]}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2026-01-16T03:42:30.469065Z",
9
+ "iopub.status.busy": "2026-01-16T03:42:30.467530Z",
10
+ "iopub.status.idle": "2026-01-16T03:42:30.474262Z",
11
+ "shell.execute_reply": "2026-01-16T03:42:30.473090Z",
12
+ "shell.execute_reply.started": "2026-01-16T03:42:30.468918Z"
13
+ },
14
+ "trusted": true
15
+ },
16
+ "outputs": [],
17
+ "source": [
18
+ "import pandas as pd\n",
19
+ "import numpy as np\n",
20
+ "\n",
21
+ "from sklearn.model_selection import train_test_split\n",
22
+ "from sklearn.preprocessing import StandardScaler\n",
23
+ "from sklearn.metrics import accuracy_score, classification_report"
24
+ ]
25
+ },
26
+ {
27
+ "cell_type": "code",
28
+ "execution_count": null,
29
+ "metadata": {
30
+ "execution": {
31
+ "iopub.execute_input": "2026-01-16T03:42:44.598336Z",
32
+ "iopub.status.busy": "2026-01-16T03:42:44.598005Z",
33
+ "iopub.status.idle": "2026-01-16T03:42:44.666341Z",
34
+ "shell.execute_reply": "2026-01-16T03:42:44.665147Z",
35
+ "shell.execute_reply.started": "2026-01-16T03:42:44.598308Z"
36
+ },
37
+ "trusted": true
38
+ },
39
+ "outputs": [],
40
+ "source": [
41
+ "# DATA_PATH = \"/kaggle/input/bot-detection-data/bot_detection_data.csv\"\n",
42
+ "DATA_PATH = \"/kaggle/input/bot-detection-data/training_data.csv\"\n",
43
+ "\n",
44
+ "df = pd.read_csv(DATA_PATH)\n",
45
+ "print(df.shape)"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": null,
51
+ "metadata": {},
52
+ "outputs": [],
53
+ "source": []
54
+ },
55
+ {
56
+ "cell_type": "code",
57
+ "execution_count": null,
58
+ "metadata": {
59
+ "execution": {
60
+ "iopub.execute_input": "2026-01-16T03:42:50.039918Z",
61
+ "iopub.status.busy": "2026-01-16T03:42:50.039522Z",
62
+ "iopub.status.idle": "2026-01-16T03:42:50.059844Z",
63
+ "shell.execute_reply": "2026-01-16T03:42:50.058651Z",
64
+ "shell.execute_reply.started": "2026-01-16T03:42:50.039876Z"
65
+ },
66
+ "trusted": true
67
+ },
68
+ "outputs": [],
69
+ "source": [
70
+ "df.head()"
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "code",
75
+ "execution_count": null,
76
+ "metadata": {
77
+ "execution": {
78
+ "iopub.execute_input": "2026-01-16T03:43:06.917403Z",
79
+ "iopub.status.busy": "2026-01-16T03:43:06.916688Z",
80
+ "iopub.status.idle": "2026-01-16T03:43:06.924961Z",
81
+ "shell.execute_reply": "2026-01-16T03:43:06.924063Z",
82
+ "shell.execute_reply.started": "2026-01-16T03:43:06.917366Z"
83
+ },
84
+ "trusted": true
85
+ },
86
+ "outputs": [],
87
+ "source": [
88
+ "FEATURES = [\n",
89
+ " \"followers_count\",\n",
90
+ " \"friends_count\",\n",
91
+ " \"listedcount\",\n",
92
+ " \"favourites_count\",\n",
93
+ " \"statuses_count\",\n",
94
+ " \"verified\",\n",
95
+ " \"default_profile\",\n",
96
+ " \"default_profile_image\",\n",
97
+ " \"has_extended_profile\"\n",
98
+ "]\n",
99
+ "\n",
100
+ "X = df[FEATURES].fillna(0)\n",
101
+ "y = df[\"bot\"]"
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "code",
106
+ "execution_count": null,
107
+ "metadata": {
108
+ "execution": {
109
+ "iopub.execute_input": "2026-01-16T03:43:16.183239Z",
110
+ "iopub.status.busy": "2026-01-16T03:43:16.182880Z",
111
+ "iopub.status.idle": "2026-01-16T03:43:16.189999Z",
112
+ "shell.execute_reply": "2026-01-16T03:43:16.188760Z",
113
+ "shell.execute_reply.started": "2026-01-16T03:43:16.183210Z"
114
+ },
115
+ "trusted": true
116
+ },
117
+ "outputs": [],
118
+ "source": [
119
+ "bool_cols = [\n",
120
+ " \"verified\",\n",
121
+ " \"default_profile\",\n",
122
+ " \"default_profile_image\",\n",
123
+ " \"has_extended_profile\"\n",
124
+ "]\n",
125
+ "\n",
126
+ "for col in bool_cols:\n",
127
+ " X[col] = X[col].astype(int)"
128
+ ]
129
+ },
130
+ {
131
+ "cell_type": "code",
132
+ "execution_count": null,
133
+ "metadata": {
134
+ "execution": {
135
+ "iopub.execute_input": "2026-01-16T03:43:52.115697Z",
136
+ "iopub.status.busy": "2026-01-16T03:43:52.115333Z",
137
+ "iopub.status.idle": "2026-01-16T03:43:52.121777Z",
138
+ "shell.execute_reply": "2026-01-16T03:43:52.120660Z",
139
+ "shell.execute_reply.started": "2026-01-16T03:43:52.115666Z"
140
+ },
141
+ "trusted": true
142
+ },
143
+ "outputs": [],
144
+ "source": [
145
+ "X[\"follow_ratio\"] = X[\"followers_count\"] / (X[\"friends_count\"] + 1)"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": null,
151
+ "metadata": {
152
+ "execution": {
153
+ "iopub.execute_input": "2026-01-16T03:38:57.765197Z",
154
+ "iopub.status.busy": "2026-01-16T03:38:57.764874Z",
155
+ "iopub.status.idle": "2026-01-16T03:38:57.794042Z",
156
+ "shell.execute_reply": "2026-01-16T03:38:57.793068Z",
157
+ "shell.execute_reply.started": "2026-01-16T03:38:57.765161Z"
158
+ },
159
+ "trusted": true
160
+ },
161
+ "outputs": [],
162
+ "source": [
163
+ "df[\"created_at\"] = pd.to_datetime(df[\"created_at\"], errors=\"coerce\")\n",
164
+ "\n",
165
+ "X[\"account_age_days\"] = (\n",
166
+ " pd.Timestamp.now() - df[\"created_at\"]\n",
167
+ ").dt.days.fillna(0)\n"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": null,
173
+ "metadata": {
174
+ "execution": {
175
+ "iopub.execute_input": "2026-01-16T03:38:57.795374Z",
176
+ "iopub.status.busy": "2026-01-16T03:38:57.795084Z",
177
+ "iopub.status.idle": "2026-01-16T03:38:57.817354Z",
178
+ "shell.execute_reply": "2026-01-16T03:38:57.816386Z",
179
+ "shell.execute_reply.started": "2026-01-16T03:38:57.795348Z"
180
+ },
181
+ "trusted": true
182
+ },
183
+ "outputs": [],
184
+ "source": [
185
+ "from sklearn.model_selection import train_test_split\n",
186
+ "\n",
187
+ "X_train, X_test, y_train, y_test = train_test_split(\n",
188
+ " X,\n",
189
+ " y,\n",
190
+ " test_size=0.2,\n",
191
+ " random_state=42\n",
192
+ ")\n"
193
+ ]
194
+ },
195
+ {
196
+ "cell_type": "code",
197
+ "execution_count": null,
198
+ "metadata": {
199
+ "execution": {
200
+ "iopub.execute_input": "2026-01-16T03:38:57.818883Z",
201
+ "iopub.status.busy": "2026-01-16T03:38:57.818519Z",
202
+ "iopub.status.idle": "2026-01-16T03:38:59.208010Z",
203
+ "shell.execute_reply": "2026-01-16T03:38:59.207044Z",
204
+ "shell.execute_reply.started": "2026-01-16T03:38:57.818853Z"
205
+ },
206
+ "trusted": true
207
+ },
208
+ "outputs": [],
209
+ "source": [
210
+ "from sklearn.ensemble import RandomForestClassifier\n",
211
+ "\n",
212
+ "rf = RandomForestClassifier(\n",
213
+ " n_estimators=300,\n",
214
+ " max_depth=20,\n",
215
+ " min_samples_leaf=2,\n",
216
+ " class_weight=\"balanced\",\n",
217
+ " random_state=42,\n",
218
+ " n_jobs=-1\n",
219
+ ")\n",
220
+ "\n",
221
+ "rf.fit(X_train, y_train)\n"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "code",
226
+ "execution_count": null,
227
+ "metadata": {
228
+ "execution": {
229
+ "iopub.execute_input": "2026-01-16T03:38:59.210120Z",
230
+ "iopub.status.busy": "2026-01-16T03:38:59.209455Z",
231
+ "iopub.status.idle": "2026-01-16T03:38:59.361078Z",
232
+ "shell.execute_reply": "2026-01-16T03:38:59.360209Z",
233
+ "shell.execute_reply.started": "2026-01-16T03:38:59.210087Z"
234
+ },
235
+ "trusted": true
236
+ },
237
+ "outputs": [],
238
+ "source": [
239
+ "preds = rf.predict(X_test)\n",
240
+ "\n",
241
+ "print(\"Accuracy:\", accuracy_score(y_test, preds))\n",
242
+ "print(classification_report(y_test, preds))\n"
243
+ ]
244
+ },
245
+ {
246
+ "cell_type": "code",
247
+ "execution_count": null,
248
+ "metadata": {
249
+ "execution": {
250
+ "iopub.execute_input": "2026-01-16T03:38:59.363663Z",
251
+ "iopub.status.busy": "2026-01-16T03:38:59.363334Z",
252
+ "iopub.status.idle": "2026-01-16T03:38:59.445148Z",
253
+ "shell.execute_reply": "2026-01-16T03:38:59.444321Z",
254
+ "shell.execute_reply.started": "2026-01-16T03:38:59.363633Z"
255
+ },
256
+ "trusted": true
257
+ },
258
+ "outputs": [],
259
+ "source": [
260
+ "imp = pd.DataFrame({\n",
261
+ " \"feature\": X.columns,\n",
262
+ " \"importance\": rf.feature_importances_\n",
263
+ "}).sort_values(by=\"importance\", ascending=False)\n",
264
+ "\n",
265
+ "print(imp)"
266
+ ]
267
+ },
268
+ {
269
+ "cell_type": "code",
270
+ "execution_count": null,
271
+ "metadata": {
272
+ "trusted": true
273
+ },
274
+ "outputs": [],
275
+ "source": []
276
+ }
277
+ ],
278
+ "metadata": {
279
+ "kaggle": {
280
+ "accelerator": "none",
281
+ "dataSources": [
282
+ {
283
+ "datasetId": 9259817,
284
+ "sourceId": 14497523,
285
+ "sourceType": "datasetVersion"
286
+ }
287
+ ],
288
+ "dockerImageVersionId": 31234,
289
+ "isGpuEnabled": false,
290
+ "isInternetEnabled": true,
291
+ "language": "python",
292
+ "sourceType": "notebook"
293
+ },
294
+ "kernelspec": {
295
+ "display_name": "Python 3",
296
+ "language": "python",
297
+ "name": "python3"
298
+ },
299
+ "language_info": {
300
+ "codemirror_mode": {
301
+ "name": "ipython",
302
+ "version": 3
303
+ },
304
+ "file_extension": ".py",
305
+ "mimetype": "text/x-python",
306
+ "name": "python",
307
+ "nbconvert_exporter": "python",
308
+ "pygments_lexer": "ipython3",
309
+ "version": "3.12.12"
310
+ }
311
+ },
312
+ "nbformat": 4,
313
+ "nbformat_minor": 4
314
+ }
bot_detector_model.pkl → bot_model.joblib RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4a0e804df2cdf0b0c78673cd4c64e0ea1e0d89f74ef678d672c1ff753cc9c92e
3
- size 433620210
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42ceefedd106c136212ada4eb5cb49325228010ebde56edc0b1379da44d23a95
3
+ size 4234857
requirements.txt CHANGED
@@ -1,10 +1,6 @@
1
  streamlit
2
- scikit-learn
3
  pandas
4
- numpy
5
- seaborn
6
- matplotlib
7
-
8
- torch
9
  plotly
10
- transformers
 
1
  streamlit
 
2
  pandas
3
+ numpy==1.26.4
4
+ scikit-learn==1.3.2
5
+ joblib==1.3.2
 
 
6
  plotly