honey234 commited on
Commit
bfc366c
·
1 Parent(s): 3976c3d
Files changed (2) hide show
  1. CONTRIBUTING.md +0 -35
  2. main.ipynb +0 -1044
CONTRIBUTING.md DELETED
@@ -1,35 +0,0 @@
1
- # Contributing to selenium-twitter-scraper
2
-
3
- We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
4
-
5
- - Reporting a bug
6
- - Discussing the current state of the code
7
- - Submitting a fix
8
- - Proposing new features
9
- - Becoming a maintainer
10
-
11
- ## We Develop with Github
12
-
13
- We use github to host code, to track issues and feature requests, as well as accept pull requests.
14
-
15
- ## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
16
-
17
- Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://docs.github.com/en/get-started/quickstart/github-flow). We actively welcome your pull requests:
18
-
19
- 1. Fork the repo and create your branch from `master`.
20
- 2. If you've added code that should be tested, add tests.
21
- 3. Ensure the test suite passes.
22
- 4. Make sure your code lints.
23
- 5. Issue that pull request!
24
-
25
- ## Any contributions you make will be under the Apache License Version 2.0 Software License
26
-
27
- In short, when you submit code changes, your submissions are understood to be under the same [Apache License Version 2.0 License](https://choosealicense.com/licenses/apache-2.0/) that covers the project. Feel free to contact the maintainers if that's a concern.
28
-
29
- ## Report bugs using Github's [issues](https://github.com/godkingjay/selenium-twitter-scraper/issues)
30
-
31
- We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/godkingjay/selenium-twitter-scraper/issues/new); it's that easy!
32
-
33
- ## License
34
-
35
- By contributing, you agree that your contributions will be licensed under its Apache License Version 2.0.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
main.ipynb DELETED
@@ -1,1044 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "attachments": {},
5
- "cell_type": "markdown",
6
- "metadata": {},
7
- "source": [
8
- "# Twitter Scraper using Selenium\n",
9
- "\n",
10
- "Scraper for Twitter Tweets using selenium. It can scrape tweets from:\n",
11
- "- Home/New Feeds\n",
12
- "- User Profile Tweets\n",
13
- "- Query or Search Tweets\n",
14
- "- Hashtags Tweets\n",
15
- "- Advanced Search Tweets"
16
- ]
17
- },
18
- {
19
- "cell_type": "code",
20
- "execution_count": null,
21
- "metadata": {},
22
- "outputs": [],
23
- "source": [
24
- "import os\n",
25
- "import sys\n",
26
- "import pandas as pd\n",
27
- "\n",
28
- "from datetime import datetime\n",
29
- "from fake_headers import Headers\n",
30
- "from time import sleep\n",
31
- "from selenium import webdriver\n",
32
- "from selenium.webdriver import Chrome\n",
33
- "from selenium.webdriver.common.keys import Keys\n",
34
- "from selenium.common.exceptions import (\n",
35
- " NoSuchElementException,\n",
36
- " StaleElementReferenceException,\n",
37
- " WebDriverException,\n",
38
- ")\n",
39
- "from selenium.webdriver.common.action_chains import ActionChains\n",
40
- "\n",
41
- "from selenium.webdriver.chrome.webdriver import WebDriver\n",
42
- "from selenium.webdriver.chrome.options import Options as ChromeOptions\n",
43
- "from selenium.webdriver.chrome.service import Service as ChromeService\n",
44
- "\n",
45
- "from webdriver_manager.chrome import ChromeDriverManager"
46
- ]
47
- },
48
- {
49
- "attachments": {},
50
- "cell_type": "markdown",
51
- "metadata": {},
52
- "source": [
53
- "# Progress Class\n",
54
- "\n",
55
- "Class for the progress of the scraper instance."
56
- ]
57
- },
58
- {
59
- "cell_type": "code",
60
- "execution_count": null,
61
- "metadata": {},
62
- "outputs": [],
63
- "source": [
64
- "class Progress:\n",
65
- " def __init__(self, current, total) -> None:\n",
66
- " self.current = current\n",
67
- " self.total = total\n",
68
- " pass\n",
69
- "\n",
70
- " def print_progress(self, current) -> None:\n",
71
- " self.current = current\n",
72
- " progress = current / self.total\n",
73
- " bar_length = 40\n",
74
- " progress_bar = (\n",
75
- " \"[\"\n",
76
- " + \"=\" * int(bar_length * progress)\n",
77
- " + \"-\" * (bar_length - int(bar_length * progress))\n",
78
- " + \"]\"\n",
79
- " )\n",
80
- " sys.stdout.write(\n",
81
- " \"\\rProgress: [{:<40}] {:.2%} {} of {}\".format(\n",
82
- " progress_bar, progress, current, self.total\n",
83
- " )\n",
84
- " )\n",
85
- " sys.stdout.flush()\n"
86
- ]
87
- },
88
- {
89
- "attachments": {},
90
- "cell_type": "markdown",
91
- "metadata": {},
92
- "source": [
93
- "# Scroller Class\n",
94
- "\n",
95
- "Class for the scrollbar of the web page."
96
- ]
97
- },
98
- {
99
- "cell_type": "code",
100
- "execution_count": null,
101
- "metadata": {},
102
- "outputs": [],
103
- "source": [
104
- "class Scroller:\n",
105
- " def __init__(self, driver) -> None:\n",
106
- " self.driver = driver\n",
107
- " self.current_position = 0\n",
108
- " self.last_position = driver.execute_script(\"return window.pageYOffset;\")\n",
109
- " self.scrolling = True\n",
110
- " self.scroll_count = 0\n",
111
- " pass\n",
112
- "\n",
113
- " def reset(self) -> None:\n",
114
- " self.current_position = 0\n",
115
- " self.last_position = self.driver.execute_script(\"return window.pageYOffset;\")\n",
116
- " self.scroll_count = 0\n",
117
- " pass\n",
118
- "\n",
119
- " def scroll_to_top(self) -> None:\n",
120
- " self.driver.execute_script(\"window.scrollTo(0, 0);\")\n",
121
- " pass\n",
122
- "\n",
123
- " def scroll_to_bottom(self) -> None:\n",
124
- " self.driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n",
125
- " pass\n",
126
- "\n",
127
- " def update_scroll_position(self) -> None:\n",
128
- " self.current_position = self.driver.execute_script(\"return window.pageYOffset;\")\n",
129
- " pass\n"
130
- ]
131
- },
132
- {
133
- "attachments": {},
134
- "cell_type": "markdown",
135
- "metadata": {},
136
- "source": [
137
- "# Tweet Class\n",
138
- "\n",
139
- "Object for the tweet. Including its data."
140
- ]
141
- },
142
- {
143
- "cell_type": "code",
144
- "execution_count": null,
145
- "metadata": {},
146
- "outputs": [],
147
- "source": [
148
- "class Tweet:\n",
149
- " def __init__(\n",
150
- " self,\n",
151
- " card: WebDriver,\n",
152
- " driver: WebDriver,\n",
153
- " actions: ActionChains,\n",
154
- " scrape_poster_details=False\n",
155
- " ) -> None:\n",
156
- " self.card = card\n",
157
- " self.error = False\n",
158
- " self.tweet = None\n",
159
- "\n",
160
- " try:\n",
161
- " self.user = card.find_element(\n",
162
- " \"xpath\", './/div[@data-testid=\"User-Name\"]//span'\n",
163
- " ).text\n",
164
- " except NoSuchElementException:\n",
165
- " self.error = True\n",
166
- " self.user = \"skip\"\n",
167
- "\n",
168
- " try:\n",
169
- " self.handle = card.find_element(\n",
170
- " \"xpath\", './/span[contains(text(), \"@\")]'\n",
171
- " ).text\n",
172
- " except NoSuchElementException:\n",
173
- " self.error = True\n",
174
- " self.handle = \"skip\"\n",
175
- "\n",
176
- " try:\n",
177
- " self.date_time = card.find_element(\"xpath\", \".//time\").get_attribute(\n",
178
- " \"datetime\"\n",
179
- " )\n",
180
- "\n",
181
- " if self.date_time is not None:\n",
182
- " self.is_ad = False\n",
183
- " except NoSuchElementException:\n",
184
- " self.is_ad = True\n",
185
- " self.error = True\n",
186
- " self.date_time = \"skip\"\n",
187
- " \n",
188
- " if self.error:\n",
189
- " return\n",
190
- "\n",
191
- " try:\n",
192
- " card.find_element(\n",
193
- " \"xpath\", './/*[local-name()=\"svg\" and @data-testid=\"icon-verified\"]'\n",
194
- " )\n",
195
- "\n",
196
- " self.verified = True\n",
197
- " except NoSuchElementException:\n",
198
- " self.verified = False\n",
199
- "\n",
200
- " self.content = \"\"\n",
201
- " contents = card.find_elements(\n",
202
- " \"xpath\",\n",
203
- " '(.//div[@data-testid=\"tweetText\"])[1]/span | (.//div[@data-testid=\"tweetText\"])[1]/a',\n",
204
- " )\n",
205
- "\n",
206
- " for index, content in enumerate(contents):\n",
207
- " self.content += content.text\n",
208
- "\n",
209
- " try:\n",
210
- " self.reply_cnt = card.find_element(\n",
211
- " \"xpath\", './/div[@data-testid=\"reply\"]//span'\n",
212
- " ).text\n",
213
- " \n",
214
- " if self.reply_cnt == \"\":\n",
215
- " self.reply_cnt = \"0\"\n",
216
- " except NoSuchElementException:\n",
217
- " self.reply_cnt = \"0\"\n",
218
- "\n",
219
- " try:\n",
220
- " self.retweet_cnt = card.find_element(\n",
221
- " \"xpath\", './/div[@data-testid=\"retweet\"]//span'\n",
222
- " ).text\n",
223
- " \n",
224
- " if self.retweet_cnt == \"\":\n",
225
- " self.retweet_cnt = \"0\"\n",
226
- " except NoSuchElementException:\n",
227
- " self.retweet_cnt = \"0\"\n",
228
- "\n",
229
- " try:\n",
230
- " self.like_cnt = card.find_element(\n",
231
- " \"xpath\", './/div[@data-testid=\"like\"]//span'\n",
232
- " ).text\n",
233
- " \n",
234
- " if self.like_cnt == \"\":\n",
235
- " self.like_cnt = \"0\"\n",
236
- " except NoSuchElementException:\n",
237
- " self.like_cnt = \"0\"\n",
238
- "\n",
239
- " try:\n",
240
- " self.analytics_cnt = card.find_element(\n",
241
- " \"xpath\", './/a[contains(@href, \"/analytics\")]//span'\n",
242
- " ).text\n",
243
- " \n",
244
- " if self.analytics_cnt == \"\":\n",
245
- " self.analytics_cnt = \"0\"\n",
246
- " except NoSuchElementException:\n",
247
- " self.analytics_cnt = \"0\"\n",
248
- "\n",
249
- " try:\n",
250
- " self.tags = card.find_elements(\n",
251
- " \"xpath\",\n",
252
- " './/a[contains(@href, \"src=hashtag_click\")]',\n",
253
- " )\n",
254
- "\n",
255
- " self.tags = [tag.text for tag in self.tags]\n",
256
- " except NoSuchElementException:\n",
257
- " self.tags = []\n",
258
- " \n",
259
- " try:\n",
260
- " self.mentions = card.find_elements(\n",
261
- " \"xpath\",\n",
262
- " '(.//div[@data-testid=\"tweetText\"])[1]//a[contains(text(), \"@\")]',\n",
263
- " )\n",
264
- "\n",
265
- " self.mentions = [mention.text for mention in self.mentions]\n",
266
- " except NoSuchElementException:\n",
267
- " self.mentions = []\n",
268
- " \n",
269
- " try:\n",
270
- " raw_emojis = card.find_elements(\n",
271
- " \"xpath\",\n",
272
- " '(.//div[@data-testid=\"tweetText\"])[1]/img[contains(@src, \"emoji\")]',\n",
273
- " )\n",
274
- " \n",
275
- " self.emojis = [emoji.get_attribute(\"alt\").encode(\"unicode-escape\").decode(\"ASCII\") for emoji in raw_emojis]\n",
276
- " except NoSuchElementException:\n",
277
- " self.emojis = []\n",
278
- " \n",
279
- " try:\n",
280
- " self.profile_img = card.find_element(\n",
281
- " \"xpath\", './/div[@data-testid=\"Tweet-User-Avatar\"]//img'\n",
282
- " ).get_attribute(\"src\")\n",
283
- " except NoSuchElementException:\n",
284
- " self.profile_img = \"\"\n",
285
- " \n",
286
- " try:\n",
287
- " self.tweet_link = self.card.find_element(\n",
288
- " \"xpath\",\n",
289
- " \".//a[contains(@href, '/status/')]\",\n",
290
- " ).get_attribute(\"href\")\n",
291
- " self.tweet_id = str(self.tweet_link.split(\"/\")[-1])\n",
292
- " except NoSuchElementException:\n",
293
- " self.tweet_link = \"\"\n",
294
- " self.tweet_id = \"\"\n",
295
- " \n",
296
- " self.following_cnt = \"0\"\n",
297
- " self.followers_cnt = \"0\"\n",
298
- " self.user_id = None\n",
299
- " \n",
300
- " if scrape_poster_details:\n",
301
- " el_name = card.find_element(\n",
302
- " \"xpath\", './/div[@data-testid=\"User-Name\"]//span'\n",
303
- " )\n",
304
- " \n",
305
- " ext_hover_card = False\n",
306
- " ext_user_id = False\n",
307
- " ext_following = False\n",
308
- " ext_followers = False\n",
309
- " hover_attempt = 0\n",
310
- " \n",
311
- " while not ext_hover_card or not ext_user_id or not ext_following or not ext_followers:\n",
312
- " try:\n",
313
- " actions.move_to_element(el_name).perform()\n",
314
- " \n",
315
- " hover_card = driver.find_element(\n",
316
- " \"xpath\",\n",
317
- " '//div[@data-testid=\"hoverCardParent\"]'\n",
318
- " )\n",
319
- " \n",
320
- " ext_hover_card = True\n",
321
- " \n",
322
- " while not ext_user_id:\n",
323
- " try:\n",
324
- " raw_user_id = hover_card.find_element(\n",
325
- " \"xpath\",\n",
326
- " '(.//div[contains(@data-testid, \"-follow\")]) | (.//div[contains(@data-testid, \"-unfollow\")])'\n",
327
- " ).get_attribute(\"data-testid\")\n",
328
- " \n",
329
- " if raw_user_id == \"\":\n",
330
- " self.user_id = None\n",
331
- " else:\n",
332
- " self.user_id = str(raw_user_id.split(\"-\")[0])\n",
333
- " \n",
334
- " ext_user_id = True\n",
335
- " except NoSuchElementException:\n",
336
- " continue\n",
337
- " except StaleElementReferenceException:\n",
338
- " self.error = True\n",
339
- " return\n",
340
- " \n",
341
- " while not ext_following:\n",
342
- " try:\n",
343
- " self.following_cnt = hover_card.find_element(\n",
344
- " \"xpath\",\n",
345
- " './/a[contains(@href, \"/following\")]//span'\n",
346
- " ).text\n",
347
- " \n",
348
- " if self.following_cnt == \"\":\n",
349
- " self.following_cnt = \"0\"\n",
350
- " \n",
351
- " ext_following = True\n",
352
- " except NoSuchElementException:\n",
353
- " continue\n",
354
- " except StaleElementReferenceException:\n",
355
- " self.error = True\n",
356
- " return\n",
357
- " \n",
358
- " while not ext_followers:\n",
359
- " try:\n",
360
- " self.followers_cnt = hover_card.find_element(\n",
361
- " \"xpath\",\n",
362
- " './/a[contains(@href, \"/verified_followers\")]//span'\n",
363
- " ).text\n",
364
- " \n",
365
- " if self.followers_cnt == \"\":\n",
366
- " self.followers_cnt = \"0\"\n",
367
- " \n",
368
- " ext_followers = True\n",
369
- " except NoSuchElementException:\n",
370
- " continue\n",
371
- " except StaleElementReferenceException:\n",
372
- " self.error = True\n",
373
- " return\n",
374
- " except NoSuchElementException:\n",
375
- " if hover_attempt==3:\n",
376
- " self.error\n",
377
- " return\n",
378
- " hover_attempt+=1\n",
379
- " sleep(0.5)\n",
380
- " continue\n",
381
- " except StaleElementReferenceException:\n",
382
- " self.error = True\n",
383
- " return\n",
384
- " \n",
385
- " if ext_hover_card and ext_following and ext_followers:\n",
386
- " actions.reset_actions()\n",
387
- " \n",
388
- " self.tweet = (\n",
389
- " self.user,\n",
390
- " self.handle,\n",
391
- " self.date_time,\n",
392
- " self.verified,\n",
393
- " self.content,\n",
394
- " self.reply_cnt,\n",
395
- " self.retweet_cnt,\n",
396
- " self.like_cnt,\n",
397
- " self.analytics_cnt,\n",
398
- " self.tags,\n",
399
- " self.mentions,\n",
400
- " self.emojis,\n",
401
- " self.profile_img,\n",
402
- " self.tweet_link,\n",
403
- " self.tweet_id,\n",
404
- " self.user_id,\n",
405
- " self.following_cnt,\n",
406
- " self.followers_cnt,\n",
407
- " )\n",
408
- "\n",
409
- " pass\n"
410
- ]
411
- },
412
- {
413
- "attachments": {},
414
- "cell_type": "markdown",
415
- "metadata": {},
416
- "source": [
417
- "# Twitter Scraper Class\n",
418
- "\n",
419
- "Class for the Twitter Scraper."
420
- ]
421
- },
422
- {
423
- "cell_type": "code",
424
- "execution_count": null,
425
- "metadata": {},
426
- "outputs": [],
427
- "source": [
428
- "TWITTER_LOGIN_URL = \"https://twitter.com/i/flow/login\"\n",
429
- "\n",
430
- "class Twitter_Scraper:\n",
431
- " def __init__(\n",
432
- " self,\n",
433
- " username,\n",
434
- " password,\n",
435
- " max_tweets=50,\n",
436
- " scrape_username=None,\n",
437
- " scrape_hashtag=None,\n",
438
- " scrape_query=None,\n",
439
- " scrape_poster_details=False,\n",
440
- " scrape_latest=True,\n",
441
- " scrape_top=False,\n",
442
- " ):\n",
443
- " print(\"Initializing Twitter Scraper...\")\n",
444
- " self.username = username\n",
445
- " self.password = password\n",
446
- " self.interrupted = False\n",
447
- " self.tweet_ids = set()\n",
448
- " self.data = []\n",
449
- " self.tweet_cards = []\n",
450
- " self.scraper_details = {\n",
451
- " \"type\": None,\n",
452
- " \"username\": None,\n",
453
- " \"hashtag\": None,\n",
454
- " \"query\": None,\n",
455
- " \"tab\": None,\n",
456
- " \"poster_details\": False,\n",
457
- " }\n",
458
- " self.max_tweets = max_tweets\n",
459
- " self.progress = Progress(0, max_tweets)\n",
460
- " self.router = self.go_to_home\n",
461
- " self.driver = self._get_driver()\n",
462
- " self.actions = ActionChains(self.driver)\n",
463
- " self.scroller = Scroller(self.driver)\n",
464
- " self._config_scraper(\n",
465
- " max_tweets,\n",
466
- " scrape_username,\n",
467
- " scrape_hashtag,\n",
468
- " scrape_query,\n",
469
- " scrape_latest,\n",
470
- " scrape_top,\n",
471
- " scrape_poster_details,\n",
472
- " )\n",
473
- "\n",
474
- " def _config_scraper(\n",
475
- " self,\n",
476
- " max_tweets=50,\n",
477
- " scrape_username=None,\n",
478
- " scrape_hashtag=None,\n",
479
- " scrape_query=None,\n",
480
- " scrape_latest=True,\n",
481
- " scrape_top=False,\n",
482
- " scrape_poster_details=False,\n",
483
- " ):\n",
484
- " self.tweet_ids = set()\n",
485
- " self.data = []\n",
486
- " self.tweet_cards = []\n",
487
- " self.max_tweets = max_tweets\n",
488
- " self.progress = Progress(0, max_tweets)\n",
489
- " self.scraper_details = {\n",
490
- " \"type\": None,\n",
491
- " \"username\": scrape_username,\n",
492
- " \"hashtag\": str(scrape_hashtag).replace(\"#\", \"\")\n",
493
- " if scrape_hashtag is not None\n",
494
- " else None,\n",
495
- " \"query\": scrape_query,\n",
496
- " \"tab\": \"Latest\" if scrape_latest else \"Top\" if scrape_top else \"Latest\",\n",
497
- " \"poster_details\": scrape_poster_details,\n",
498
- " }\n",
499
- " self.router = self.go_to_home\n",
500
- " self.scroller = Scroller(self.driver)\n",
501
- "\n",
502
- " if scrape_username is not None:\n",
503
- " self.scraper_details[\"type\"] = \"Username\"\n",
504
- " self.router = self.go_to_profile\n",
505
- " elif scrape_hashtag is not None:\n",
506
- " self.scraper_details[\"type\"] = \"Hashtag\"\n",
507
- " self.router = self.go_to_hashtag\n",
508
- " elif scrape_query is not None:\n",
509
- " self.scraper_details[\"type\"] = \"Query\"\n",
510
- " self.router = self.go_to_search\n",
511
- " else:\n",
512
- " self.scraper_details[\"type\"] = \"Home\"\n",
513
- " self.router = self.go_to_home\n",
514
- " pass\n",
515
- "\n",
516
- " def _get_driver(self):\n",
517
- " print(\"Setup WebDriver...\")\n",
518
- " header = Headers().generate()[\"User-Agent\"]\n",
519
- "\n",
520
- " browser_option = ChromeOptions()\n",
521
- " browser_option.add_argument(\"--no-sandbox\")\n",
522
- " browser_option.add_argument(\"--disable-dev-shm-usage\")\n",
523
- " browser_option.add_argument(\"--ignore-certificate-errors\")\n",
524
- " browser_option.add_argument(\"--disable-gpu\")\n",
525
- " browser_option.add_argument(\"--log-level=3\")\n",
526
- " browser_option.add_argument(\"--disable-notifications\")\n",
527
- " browser_option.add_argument(\"--disable-popup-blocking\")\n",
528
- " browser_option.add_argument(\"--user-agent={}\".format(header))\n",
529
- "\n",
530
- " # For Hiding Browser\n",
531
- " browser_option.add_argument(\"--headless\")\n",
532
- "\n",
533
- " try:\n",
534
- " print(\"Initializing ChromeDriver...\")\n",
535
- " driver = webdriver.Chrome(\n",
536
- " options=browser_option,\n",
537
- " )\n",
538
- "\n",
539
- " print(\"WebDriver Setup Complete\")\n",
540
- " return driver\n",
541
- " except WebDriverException:\n",
542
- " try:\n",
543
- " print(\"Downloading ChromeDriver...\")\n",
544
- " chromedriver_path = ChromeDriverManager().install()\n",
545
- " chrome_service = ChromeService(executable_path=chromedriver_path)\n",
546
- "\n",
547
- " print(\"Initializing ChromeDriver...\")\n",
548
- " driver = webdriver.Chrome(\n",
549
- " service=chrome_service,\n",
550
- " options=browser_option,\n",
551
- " )\n",
552
- "\n",
553
- " print(\"WebDriver Setup Complete\")\n",
554
- " return driver\n",
555
- " except Exception as e:\n",
556
- " print(f\"Error setting up WebDriver: {e}\")\n",
557
- " sys.exit(1)\n",
558
- " pass\n",
559
- "\n",
560
- " def login(self):\n",
561
- " print()\n",
562
- " print(\"Logging in to Twitter...\")\n",
563
- "\n",
564
- " try:\n",
565
- " self.driver.maximize_window()\n",
566
- " self.driver.get(TWITTER_LOGIN_URL)\n",
567
- " sleep(3)\n",
568
- "\n",
569
- " self._input_username()\n",
570
- " self._input_unusual_activity()\n",
571
- " self._input_password()\n",
572
- "\n",
573
- " cookies = self.driver.get_cookies()\n",
574
- "\n",
575
- " auth_token = None\n",
576
- "\n",
577
- " for cookie in cookies:\n",
578
- " if cookie[\"name\"] == \"auth_token\":\n",
579
- " auth_token = cookie[\"value\"]\n",
580
- " break\n",
581
- "\n",
582
- " if auth_token is None:\n",
583
- " raise ValueError(\n",
584
- " \"\"\"This may be due to the following:\n",
585
- "\n",
586
- "- Internet connection is unstable\n",
587
- "- Username is incorrect\n",
588
- "- Password is incorrect\n",
589
- "\"\"\"\n",
590
- " )\n",
591
- "\n",
592
- " print()\n",
593
- " print(\"Login Successful\")\n",
594
- " print()\n",
595
- " except Exception as e:\n",
596
- " print()\n",
597
- " print(f\"Login Failed: {e}\")\n",
598
- " sys.exit(1)\n",
599
- "\n",
600
- " pass\n",
601
- "\n",
602
- " def _input_username(self):\n",
603
- " input_attempt = 0\n",
604
- "\n",
605
- " while True:\n",
606
- " try:\n",
607
- " username = self.driver.find_element(\n",
608
- " \"xpath\", \"//input[@autocomplete='username']\"\n",
609
- " )\n",
610
- "\n",
611
- " username.send_keys(self.username)\n",
612
- " username.send_keys(Keys.RETURN)\n",
613
- " sleep(3)\n",
614
- " break\n",
615
- " except NoSuchElementException:\n",
616
- " input_attempt += 1\n",
617
- " if input_attempt >= 3:\n",
618
- " print()\n",
619
- " print(\n",
620
- " \"\"\"There was an error inputting the username.\n",
621
- "\n",
622
- "It may be due to the following:\n",
623
- "- Internet connection is unstable\n",
624
- "- Username is incorrect\n",
625
- "- Twitter is experiencing unusual activity\"\"\"\n",
626
- " )\n",
627
- " self.driver.quit()\n",
628
- " sys.exit(1)\n",
629
- " else:\n",
630
- " print(\"Re-attempting to input username...\")\n",
631
- " sleep(2)\n",
632
- "\n",
633
- " def _input_unusual_activity(self):\n",
634
- " input_attempt = 0\n",
635
- "\n",
636
- " while True:\n",
637
- " try:\n",
638
- " unusual_activity = self.driver.find_element(\n",
639
- " \"xpath\", \"//input[@data-testid='ocfEnterTextTextInput']\"\n",
640
- " )\n",
641
- " unusual_activity.send_keys(self.username)\n",
642
- " unusual_activity.send_keys(Keys.RETURN)\n",
643
- " sleep(3)\n",
644
- " break\n",
645
- " except NoSuchElementException:\n",
646
- " input_attempt += 1\n",
647
- " if input_attempt >= 3:\n",
648
- " break\n",
649
- "\n",
650
- " def _input_password(self):\n",
651
- " input_attempt = 0\n",
652
- "\n",
653
- " while True:\n",
654
- " try:\n",
655
- " password = self.driver.find_element(\n",
656
- " \"xpath\", \"//input[@autocomplete='current-password']\"\n",
657
- " )\n",
658
- "\n",
659
- " password.send_keys(self.password)\n",
660
- " password.send_keys(Keys.RETURN)\n",
661
- " sleep(3)\n",
662
- " break\n",
663
- " except NoSuchElementException:\n",
664
- " input_attempt += 1\n",
665
- " if input_attempt >= 3:\n",
666
- " print()\n",
667
- " print(\n",
668
- " \"\"\"There was an error inputting the password.\n",
669
- "\n",
670
- "It may be due to the following:\n",
671
- "- Internet connection is unstable\n",
672
- "- Password is incorrect\n",
673
- "- Twitter is experiencing unusual activity\"\"\"\n",
674
- " )\n",
675
- " self.driver.quit()\n",
676
- " sys.exit(1)\n",
677
- " else:\n",
678
- " print(\"Re-attempting to input password...\")\n",
679
- " sleep(2)\n",
680
- "\n",
681
- " def go_to_home(self):\n",
682
- " self.driver.get(\"https://twitter.com/home\")\n",
683
- " sleep(3)\n",
684
- " pass\n",
685
- "\n",
686
- " def go_to_profile(self):\n",
687
- " if (\n",
688
- " self.scraper_details[\"username\"] is None\n",
689
- " or self.scraper_details[\"username\"] == \"\"\n",
690
- " ):\n",
691
- " print(\"Username is not set.\")\n",
692
- " sys.exit(1)\n",
693
- " else:\n",
694
- " self.driver.get(f\"https://twitter.com/{self.scraper_details['username']}\")\n",
695
- " sleep(3)\n",
696
- " pass\n",
697
- "\n",
698
- " def go_to_hashtag(self):\n",
699
- " if (\n",
700
- " self.scraper_details[\"hashtag\"] is None\n",
701
- " or self.scraper_details[\"hashtag\"] == \"\"\n",
702
- " ):\n",
703
- " print(\"Hashtag is not set.\")\n",
704
- " sys.exit(1)\n",
705
- " else:\n",
706
- " url = f\"https://twitter.com/hashtag/{self.scraper_details['hashtag']}?src=hashtag_click\"\n",
707
- " if self.scraper_details[\"tab\"] == \"Latest\":\n",
708
- " url += \"&f=live\"\n",
709
- "\n",
710
- " self.driver.get(url)\n",
711
- " sleep(3)\n",
712
- " pass\n",
713
- "\n",
714
- " def go_to_search(self):\n",
715
- " if self.scraper_details[\"query\"] is None or self.scraper_details[\"query\"] == \"\":\n",
716
- " print(\"Query is not set.\")\n",
717
- " sys.exit(1)\n",
718
- " else:\n",
719
- " url = f\"https://twitter.com/search?q={self.scraper_details['query']}&src=typed_query\"\n",
720
- " if self.scraper_details[\"tab\"] == \"Latest\":\n",
721
- " url += \"&f=live\"\n",
722
- "\n",
723
- " self.driver.get(url)\n",
724
- " sleep(3)\n",
725
- " pass\n",
726
- "\n",
727
- " def get_tweet_cards(self):\n",
728
- " self.tweet_cards = self.driver.find_elements(\n",
729
- " \"xpath\", '//article[@data-testid=\"tweet\" and not(@disabled)]'\n",
730
- " )\n",
731
- " pass\n",
732
- "\n",
733
- " def remove_hidden_cards(self):\n",
734
- " try:\n",
735
- " hidden_cards = self.driver.find_elements(\n",
736
- " \"xpath\", '//article[@data-testid=\"tweet\" and @disabled]'\n",
737
- " )\n",
738
- "\n",
739
- " for card in hidden_cards[1:-2]:\n",
740
- " self.driver.execute_script(\n",
741
- " \"arguments[0].parentNode.parentNode.parentNode.remove();\", card\n",
742
- " )\n",
743
- " except Exception as e:\n",
744
- " return\n",
745
- " pass\n",
746
- "\n",
747
- " def scrape_tweets(\n",
748
- " self,\n",
749
- " max_tweets=50,\n",
750
- " scrape_username=None,\n",
751
- " scrape_hashtag=None,\n",
752
- " scrape_query=None,\n",
753
- " scrape_latest=True,\n",
754
- " scrape_top=False,\n",
755
- " scrape_poster_details=False,\n",
756
- " router=None,\n",
757
- " ):\n",
758
- " self._config_scraper(\n",
759
- " max_tweets,\n",
760
- " scrape_username,\n",
761
- " scrape_hashtag,\n",
762
- " scrape_query,\n",
763
- " scrape_latest,\n",
764
- " scrape_top,\n",
765
- " scrape_poster_details,\n",
766
- " )\n",
767
- "\n",
768
- " if router is None:\n",
769
- " router = self.router\n",
770
- "\n",
771
- " router()\n",
772
- "\n",
773
- " if self.scraper_details[\"type\"] == \"Username\":\n",
774
- " print(\n",
775
- " \"Scraping Tweets from @{}...\".format(self.scraper_details[\"username\"])\n",
776
- " )\n",
777
- " elif self.scraper_details[\"type\"] == \"Hashtag\":\n",
778
- " print(\n",
779
- " \"Scraping {} Tweets from #{}...\".format(\n",
780
- " self.scraper_details[\"tab\"], self.scraper_details[\"hashtag\"]\n",
781
- " )\n",
782
- " )\n",
783
- " elif self.scraper_details[\"type\"] == \"Query\":\n",
784
- " print(\n",
785
- " \"Scraping {} Tweets from {} search...\".format(\n",
786
- " self.scraper_details[\"tab\"], self.scraper_details[\"query\"]\n",
787
- " )\n",
788
- " )\n",
789
- " elif self.scraper_details[\"type\"] == \"Home\":\n",
790
- " print(\"Scraping Tweets from Home...\")\n",
791
- "\n",
792
- " self.progress.print_progress(0)\n",
793
- "\n",
794
- " refresh_count = 0\n",
795
- " added_tweets = 0\n",
796
- " empty_count = 0\n",
797
- "\n",
798
- " while self.scroller.scrolling:\n",
799
- " try:\n",
800
- " self.get_tweet_cards()\n",
801
- " added_tweets = 0\n",
802
- "\n",
803
- " for card in self.tweet_cards[-15:]:\n",
804
- " try:\n",
805
- " tweet_id = str(card)\n",
806
- "\n",
807
- " if tweet_id not in self.tweet_ids:\n",
808
- " self.tweet_ids.add(tweet_id)\n",
809
- "\n",
810
- " if not self.scraper_details[\"poster_details\"]:\n",
811
- " self.driver.execute_script(\n",
812
- " \"arguments[0].scrollIntoView();\", card\n",
813
- " )\n",
814
- "\n",
815
- " tweet = Tweet(\n",
816
- " card=card,\n",
817
- " driver=self.driver,\n",
818
- " actions=self.actions,\n",
819
- " scrape_poster_details=self.scraper_details[\n",
820
- " \"poster_details\"\n",
821
- " ],\n",
822
- " )\n",
823
- "\n",
824
- " if tweet:\n",
825
- " if not tweet.error and tweet.tweet is not None:\n",
826
- " if not tweet.is_ad:\n",
827
- " self.data.append(tweet.tweet)\n",
828
- " added_tweets += 1\n",
829
- " self.progress.print_progress(len(self.data))\n",
830
- "\n",
831
- " if len(self.data) >= self.max_tweets:\n",
832
- " self.scroller.scrolling = False\n",
833
- " break\n",
834
- " else:\n",
835
- " continue\n",
836
- " else:\n",
837
- " continue\n",
838
- " else:\n",
839
- " continue\n",
840
- " else:\n",
841
- " continue\n",
842
- " except NoSuchElementException:\n",
843
- " continue\n",
844
- "\n",
845
- " if len(self.data) >= self.max_tweets:\n",
846
- " break\n",
847
- "\n",
848
- " if added_tweets == 0:\n",
849
- " if empty_count >= 5:\n",
850
- " if refresh_count >= 3:\n",
851
- " print()\n",
852
- " print(\"No more tweets to scrape\")\n",
853
- " break\n",
854
- " refresh_count += 1\n",
855
- " empty_count += 1\n",
856
- " sleep(1)\n",
857
- " else:\n",
858
- " empty_count = 0\n",
859
- " refresh_count = 0\n",
860
- " except StaleElementReferenceException:\n",
861
- " sleep(2)\n",
862
- " continue\n",
863
- " except KeyboardInterrupt:\n",
864
- " print(\"\\n\")\n",
865
- " print(\"Keyboard Interrupt\")\n",
866
- " self.interrupted = True\n",
867
- " break\n",
868
- " except Exception as e:\n",
869
- " print(\"\\n\")\n",
870
- " print(f\"Error scraping tweets: {e}\")\n",
871
- " break\n",
872
- "\n",
873
- " print(\"\")\n",
874
- "\n",
875
- " if len(self.data) >= self.max_tweets:\n",
876
- " print(\"Scraping Complete\")\n",
877
- " else:\n",
878
- " print(\"Scraping Incomplete\")\n",
879
- "\n",
880
- " print(\"Tweets: {} out of {}\\n\".format(len(self.data), self.max_tweets))\n",
881
- "\n",
882
- " pass\n",
883
- "\n",
884
- " def save_to_csv(self):\n",
885
- " print(\"Saving Tweets to CSV...\")\n",
886
- " now = datetime.now()\n",
887
- " folder_path = \"./tweets/\"\n",
888
- "\n",
889
- " if not os.path.exists(folder_path):\n",
890
- " os.makedirs(folder_path)\n",
891
- " print(\"Created Folder: {}\".format(folder_path))\n",
892
- "\n",
893
- " data = {\n",
894
- " \"Name\": [tweet[0] for tweet in self.data],\n",
895
- " \"Handle\": [tweet[1] for tweet in self.data],\n",
896
- " \"Timestamp\": [tweet[2] for tweet in self.data],\n",
897
- " \"Verified\": [tweet[3] for tweet in self.data],\n",
898
- " \"Content\": [tweet[4] for tweet in self.data],\n",
899
- " \"Comments\": [tweet[5] for tweet in self.data],\n",
900
- " \"Retweets\": [tweet[6] for tweet in self.data],\n",
901
- " \"Likes\": [tweet[7] for tweet in self.data],\n",
902
- " \"Analytics\": [tweet[8] for tweet in self.data],\n",
903
- " \"Tags\": [tweet[9] for tweet in self.data],\n",
904
- " \"Mentions\": [tweet[10] for tweet in self.data],\n",
905
- " \"Emojis\": [tweet[11] for tweet in self.data],\n",
906
- " \"Profile Image\": [tweet[12] for tweet in self.data],\n",
907
- " \"Tweet Link\": [tweet[13] for tweet in self.data],\n",
908
- " \"Tweet ID\": [f'tweet_id:{tweet[14]}' for tweet in self.data],\n",
909
- " }\n",
910
- "\n",
911
- " if self.scraper_details[\"poster_details\"]:\n",
912
- " data[\"Tweeter ID\"] = [f'user_id:{tweet[15]}' for tweet in self.data]\n",
913
- " data[\"Following\"] = [tweet[16] for tweet in self.data]\n",
914
- " data[\"Followers\"] = [tweet[17] for tweet in self.data]\n",
915
- "\n",
916
- " df = pd.DataFrame(data)\n",
917
- "\n",
918
- " current_time = now.strftime(\"%Y-%m-%d_%H-%M-%S\")\n",
919
- " file_path = f\"{folder_path}{current_time}_tweets_1-{len(self.data)}.csv\"\n",
920
- " pd.set_option(\"display.max_colwidth\", None)\n",
921
- " df.to_csv(file_path, index=False, encoding=\"utf-8\")\n",
922
- "\n",
923
- " print(\"CSV Saved: {}\".format(file_path))\n",
924
- "\n",
925
- " pass\n",
926
- "\n",
927
- " def get_tweets(self):\n",
928
- " return self.data"
929
- ]
930
- },
931
- {
932
- "attachments": {},
933
- "cell_type": "markdown",
934
- "metadata": {},
935
- "source": [
936
- "# Create a new instance of the Twitter Scraper class"
937
- ]
938
- },
939
- {
940
- "cell_type": "code",
941
- "execution_count": null,
942
- "metadata": {},
943
- "outputs": [],
944
- "source": [
945
- "USER_UNAME = os.environ['TWITTER_USERNAME']\n",
946
- "USER_PASSWORD = os.environ['TWITTER_PASSWORD']\n",
947
- "\n",
948
- "scraper = Twitter_Scraper(\n",
949
- " username=USER_UNAME,\n",
950
- " password=USER_PASSWORD,\n",
951
- " # max_tweets=10,\n",
952
- " # scrape_username=\"something\",\n",
953
- " # scrape_hashtag=\"something\",\n",
954
- " # scrape_query=\"something\",\n",
955
- " # scrape_latest=False,\n",
956
- " # scrape_top=True,\n",
957
- " # scrape_poster_details=True\n",
958
- ")"
959
- ]
960
- },
961
- {
962
- "cell_type": "code",
963
- "execution_count": null,
964
- "metadata": {},
965
- "outputs": [],
966
- "source": [
967
- "scraper.login()"
968
- ]
969
- },
970
- {
971
- "attachments": {},
972
- "cell_type": "markdown",
973
- "metadata": {},
974
- "source": [
975
- "# Run Twitter Scraper"
976
- ]
977
- },
978
- {
979
- "cell_type": "code",
980
- "execution_count": null,
981
- "metadata": {},
982
- "outputs": [],
983
- "source": [
984
- "scraper.scrape_tweets(\n",
985
- " # max_tweets=100,\n",
986
- " # scrape_username=\"something\",\n",
987
- " # scrape_hashtag=\"something\",\n",
988
- " # scrape_query=\"something\",\n",
989
- " # scrape_latest=False,\n",
990
- " # scrape_top=True,\n",
991
- " # scrape_poster_details=True,\n",
992
- ")"
993
- ]
994
- },
995
- {
996
- "attachments": {},
997
- "cell_type": "markdown",
998
- "metadata": {},
999
- "source": [
1000
- "# Save Scraped Tweets in a CSV"
1001
- ]
1002
- },
1003
- {
1004
- "cell_type": "code",
1005
- "execution_count": null,
1006
- "metadata": {},
1007
- "outputs": [],
1008
- "source": [
1009
- "scraper.save_to_csv()"
1010
- ]
1011
- },
1012
- {
1013
- "cell_type": "code",
1014
- "execution_count": null,
1015
- "metadata": {},
1016
- "outputs": [],
1017
- "source": [
1018
- "scraper.driver.close()"
1019
- ]
1020
- }
1021
- ],
1022
- "metadata": {
1023
- "kernelspec": {
1024
- "display_name": "ml",
1025
- "language": "python",
1026
- "name": "python3"
1027
- },
1028
- "language_info": {
1029
- "codemirror_mode": {
1030
- "name": "ipython",
1031
- "version": 3
1032
- },
1033
- "file_extension": ".py",
1034
- "mimetype": "text/x-python",
1035
- "name": "python",
1036
- "nbconvert_exporter": "python",
1037
- "pygments_lexer": "ipython3",
1038
- "version": "3.11.5"
1039
- },
1040
- "orig_nbformat": 4
1041
- },
1042
- "nbformat": 4,
1043
- "nbformat_minor": 2
1044
- }