Upload 15 files

Browse files

Files changed (15) hide show

scraper/.DS_Store +0 -0
scraper/.gitignore +169 -0
scraper/README.md +77 -0
scraper/__pycache__/topic_crawling.cpython-310.pyc +0 -0
scraper/auto_crawl.sh +7 -0
scraper/auto_crawl_topic.sh +2 -0
scraper/main.ipynb +416 -0
scraper/main.py +258 -0
scraper/preprocessing/__pycache__/preprocessing_sub_functions.cpython-310.pyc +0 -0
scraper/preprocessing/preprocessing.py +130 -0
scraper/preprocessing/preprocessing.sh +7 -0
scraper/preprocessing/preprocessing_sub_functions.py +248 -0
scraper/sort.py +53 -0
scraper/topic_crawling.py +243 -0
scraper/website_format.json +130 -0

scraper/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

scraper/.gitignore ADDED Viewed

	@@ -0,0 +1,169 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+.vscode
+data/
+preprocessed-data/
+raw-data/
+sorted-preprocessed-data/
+sorted-raw-data/
+*.zip

scraper/README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# bitcointalk_crawler
+---
+## DataFrame Columns Description
+### 1. `start_edit`
+- **Description**: This column represents the date when the post or content was initially created.
+- **Type**: Date (format: YYYY-MM-DD)
+- **Example**: `2013-11-02`
+### 2. `last_edit`
+- **Description**: This column represents the last date when the post or content was edited.
+- **Type**: Date (format: YYYY-MM-DD)
+- **Example**: `2013-11-02`
+### 3. `author`
+- **Description**: The user who created the post.
+- **Type**: String
+- **Example**: `guyver`
+### 4. `post`
+- **Description**: The actual content or message of the post.
+- **Type**: String
+- **Example**: `before we all get excited about the second batch...`
+### 5. `topic`
+- **Description**: The topic or title of the thread in which the post was made.
+- **Type**: String
+- **Example**: `[EU/UK GROUP BUY] Blue Fury USB miner 2.2 ...`
+### 6. `attachment`
+- **Description**:  Indicates whether the post has an attachment or not. A value of `1` means there's an attachment(image or video), and `0` means there isn't. In the website, it using img tag to show the emoji but seems not to be an attachment, such that it also ignring the emojis.
+- **Type**: Integer (0 or 1)
+- **Example**: `0`
+- **Note**: The script 'attachment_fix.py' is run subsequent to the crawling process, as the initial values populated in this column post-crawling are not accurate.
+### 7. `link`
+- **Description**: Indicates whether the post contains a link or not. A value of `1` means there's a link, and `0` means there isn't.
+- **Type**: Integer (0 or 1)
+- **Example**: `0`
+### 8. `original_info`
+- **Description**: This column contains raw HTML or metadata related to the post. It may contain styling and layout information.
+- **Type**: String (HTML format)
+- **Example**: `<td class="td_headerandpost" height="100%" sty...`
+### 9. `preprocessed_post`
+- **Description**: Preprocessed of `post` column that for analysis or other tasks.
+- **Type**: String
+- **Example**: `get excited second batch.let us wait first bat...`
+---
+## Usage
+### 1. `main.py` and `auto_crawl.sh`
+- **Description**: The `main.py` script is the full script that is used to crawl the Bitcointalk forum with given the first board page. The `auto_crawl.sh` script is used to automate the process of running the `main.py` script.
+- **example**:
+```python
+python main.py
+https://bitcointalk.org/index.php?board=40.0 # board url
+--board mining_support # board name
+-pages 183 # number of pages in the board
+```
+### 2. `topic_craawling.py` and `auto_crawl_topic.sh`
+- **Description**: The `topic_crawling.py` script is used to crawl exact topic from  Bitcointalk forum with given the first  page url of the topic. The `auto_crawl_topic.sh` script is used to automate the process of running the `topic_craawling.py` script.
+- **example**:
+```python
+python topic_crawling.py
+https://bitcointalk.org/index.php?topic=168174.0 # topic url
+--board miners # board name that topic belongs to
+--num_of_pages 165 # total pages of this topic
+```

scraper/__pycache__/topic_crawling.cpython-310.pyc ADDED Viewed

Binary file (6.34 kB). View file

scraper/auto_crawl.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+# python main.py https://bitcointalk.org/index.php?board=42.0 --board miners -pages 41 -posts 523
+# python main.py https://bitcointalk.org/index.php?board=40.0 --board mining_support -pages 183
+# python main.py https://bitcointalk.org/index.php?board=76.0 --board hardware -pages 145
+# python main.py https://bitcointalk.org/index.php?board=137.0 --board groupbuys -pages 24 -posts 322
+# python main.py https://bitcointalk.org/index.php?board=81.0 --board mining_speculation -pages 95 --update
+# python main.py https://bitcointalk.org/index.php?board=41.0 --board pools -pages 52 -posts 32
+# python main.py https://bitcointalk.org/index.php?board=14.0 --board mining -pages 143 -posts 1524

scraper/auto_crawl_topic.sh ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # python topic_crawling.py https://bitcointalk.org/index.php?topic=168174.0 --board miners --num_of_pages 165
2	+ # python topic_crawling.py https://bitcointalk.org/index.php?topic=6458.0 --board miners --num_of_pages 57

scraper/main.ipynb ADDED Viewed

	@@ -0,0 +1,416 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DEMO_MODE = True"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Importing necessary libraries:\n",
+    "# - os, json, time for file, data and time operations respectively.\n",
+    "# - requests for making HTTP requests.\n",
+    "# - BeautifulSoup for parsing HTML content.\n",
+    "# - Other imports for logging, data manipulation, progress indication, and more.\n",
+    "import os\n",
+    "import json\n",
+    "import time\n",
+    "import munch\n",
+    "import requests\n",
+    "import argparse\n",
+    "import pandas as pd\n",
+    "from tqdm import tqdm\n",
+    "from datetime import date\n",
+    "from loguru import logger\n",
+    "from random import randint\n",
+    "from bs4 import BeautifulSoup, NavigableString\n",
+    "\n",
+    "from preprocessing.preprocessing_sub_functions import remove_emojis\n",
+    "from topic_crawling import loop_through_posts\n",
+    "\n",
+    "\n",
+    "# This function reads a JSON file named \"website_format.json\".\n",
+    "# The file contain a list of user agents.\n",
+    "# User agents are strings that browsers send to websites to identify themselves.\n",
+    "# This list is likely used to rotate between different user agents when making requests,\n",
+    "# making the scraper seem like different browsers and reducing the chances of being blocked.\n",
+    "def get_web_component():\n",
+    "    with open(\"website_format.json\") as json_file:\n",
+    "        website_format = json.load(json_file)\n",
+    "    website_format = munch.munchify(website_format)\n",
+    "    return website_format.USER_AGENTS\n",
+    "\n",
+    "\n",
+    "# This function fetches a webpage's content.\n",
+    "# It randomly selects a user agent from the provided list to make the request.\n",
+    "# After fetching, it uses BeautifulSoup to parse the page's HTML content.\n",
+    "# def get_web_content(url, USER_AGENTS):\n",
+    "#     random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)]\n",
+    "#     headers = {\"User-Agent\": random_agent}\n",
+    "#     req = requests.get(url, headers=headers)\n",
+    "#     req.encoding = req.apparent_encoding\n",
+    "#     soup = BeautifulSoup(req.text, features=\"lxml\")\n",
+    "#     return soup\n",
+    "from topic_crawling import get_web_content\n",
+    "\n",
+    "\n",
+    "# This function extracts pagination links from a page.\n",
+    "# These links point to other pages of content, often seen at the bottom of forums or search results.\n",
+    "# The function returns both the individual page links and the \"next\" link,\n",
+    "# which points to the next set of results.\n",
+    "def get_pages_urls(url, USER_AGENTS):\n",
+    "    time.sleep(1)\n",
+    "    soup = get_web_content(url, USER_AGENTS)\n",
+    "    # Finding the pagination links based on their HTML structure and CSS classes.\n",
+    "    first_td = soup.find(\"td\", class_=\"middletext\", id=\"toppages\")\n",
+    "    nav_pages_links = first_td.find_all(\"a\", class_=\"navPages\")\n",
+    "    href_links = [link[\"href\"] for link in nav_pages_links]\n",
+    "    next_50_link = href_links[-3]  # Assuming the third-last link is the \"next\" link.\n",
+    "    href_links.insert(0, url)\n",
+    "    return href_links, next_50_link\n",
+    "\n",
+    "\n",
+    "# This function extracts individual post URLs from a page.\n",
+    "# It's likely targeting a forum or blog structure, where multiple posts or threads are listed on one page.\n",
+    "def get_post_urls(url, USER_AGENTS):\n",
+    "    time.sleep(1)\n",
+    "    soup = get_web_content(url, USER_AGENTS)\n",
+    "    # Finding post links based on their HTML structure and CSS classes.\n",
+    "    links_elements = soup.select(\"td.windowbg span a\")\n",
+    "    links = [link[\"href\"] for link in links_elements]\n",
+    "\n",
+    "    # # If including rules and announcements posts\n",
+    "    # links_elements = soup.select('td.windowbg3 span a')\n",
+    "    # links_ = [link['href'] for link in links_elements]\n",
+    "    # links.extend(links_)\n",
+    "\n",
+    "    return links\n",
+    "\n",
+    "\n",
+    "# This function loops through the main page and its paginated versions to collect URLs.\n",
+    "# It repeatedly calls 'get_pages_urls' to fetch batches of URLs until the desired number (num_of_pages) is reached.\n",
+    "def loop_through_source_url(USER_AGENTS, url, num_of_pages):\n",
+    "    pages_urls = []\n",
+    "    counter = 0\n",
+    "    while len(pages_urls) < num_of_pages:\n",
+    "        print(\"loop_through_source_url: \", len(pages_urls))\n",
+    "        href_links, next_50_link = get_pages_urls(url, USER_AGENTS)\n",
+    "        pages_urls.extend(href_links)\n",
+    "        pages_urls = list(dict.fromkeys(pages_urls))  # Remove any duplicate URLs.\n",
+    "        url = next_50_link\n",
+    "    return pages_urls\n",
+    "\n",
+    "\n",
+    "# This function loops through the provided list of page URLs and extracts post URLs from each of these pages.\n",
+    "# It ensures that there are no duplicate post URLs by converting the list into a dictionary and back to a list.\n",
+    "# It returns a list of unique post URLs.\n",
+    "def loop_through_pages(USER_AGENTS, pages_urls):\n",
+    "    post_urls = []\n",
+    "    for url in tqdm(pages_urls):\n",
+    "        herf_links = get_post_urls(url, USER_AGENTS)\n",
+    "        post_urls.extend(herf_links)\n",
+    "        post_urls = list(dict.fromkeys(post_urls))\n",
+    "        if DEMO_MODE:\n",
+    "            break\n",
+    "    return post_urls\n",
+    "\n",
+    "# This function processes a post page. It extracts various details like timestamps, author information, post content, topic, attachments, links, and original HTML information.\n",
+    "# The function returns a dictionary containing all this extracted data.\n",
+    "def read_subject_page(USER_AGENTS, post_url, df, remove_emoji):\n",
+    "    time.sleep(1)\n",
+    "    soup = get_web_content(post_url, USER_AGENTS)\n",
+    "    form_tag = soup.find(\"form\", id=\"quickModForm\")\n",
+    "    table_tag = form_tag.find(\"table\", class_=\"bordercolor\")\n",
+    "    td_tag = table_tag.find_all(\"td\", class_=\"windowbg\")\n",
+    "    td_tag.extend(table_tag.find_all(\"td\", class_=\"windowbg2\"))\n",
+    "\n",
+    "    for comment in tqdm(td_tag):\n",
+    "        res = extract_useful_content_windowbg(comment, remove_emoji)\n",
+    "        if res is not None:\n",
+    "            df = pd.concat([df, pd.DataFrame([res])])\n",
+    "\n",
+    "    return df\n",
+    "\n",
+    "# This function extracts meaningful content from a given HTML element (`tr_tag`). This tag is likely a row in a table, given its name.\n",
+    "# The function checks the presence of specific tags and classes within this row to extract information such as timestamps, author, post content, topic, attachments, and links.\n",
+    "# The extracted data is returned as a dictionary.\n",
+    "def extract_useful_content_windowbg(tr_tag, remove_emoji=True):\n",
+    "    \"\"\"\n",
+    "    Timestamp of the post (ex: September 11, 2023, 07:49:45 AM; but if you want just 11/09/2023 is enough)\n",
+    "    Author of the post (ex: SupermanBitcoin)\n",
+    "    The post itself\n",
+    "\n",
+    "    The topic where the post was posted (ex: [INFO - DISCUSSION] Security Budget Problem) eg.  Whats your thoughts: Next-Gen Bitcoin Mining Machine With 1X Efficiency Rating.\n",
+    "    Number of characters in the post --> so this is an integer\n",
+    "    Does the post contain at least one attachment (image, video etc.) --> if yes put '1' in the column, if no, just put '0'\n",
+    "    Does the post contain at least one link --> if yes put '1' in the column, if no, just put '0'\n",
+    "    \"\"\"\n",
+    "    headerandpost = tr_tag.find(\"td\", class_=\"td_headerandpost\")\n",
+    "    if not headerandpost:\n",
+    "        return None\n",
+    "\n",
+    "    timestamp = headerandpost.find(\"div\", class_=\"smalltext\").get_text()\n",
+    "    timestamps = timestamp.split(\"Last edit: \")\n",
+    "    timestamp = timestamps[0].strip()\n",
+    "    last_edit = None\n",
+    "    if len(timestamps) > 1:\n",
+    "        if 'Today ' in timestamps[1]:\n",
+    "            last_edit = date.today().strftime(\"%B %d, %Y\")+', '+timestamps[1].split('by')[0].split(\"Today at\")[1].strip()\n",
+    "        last_edit = timestamps[1].split('by')[0].strip()\n",
+    "\n",
+    "    poster_info_tag = tr_tag.find('td', class_='poster_info')\n",
+    "    anchor_tag = poster_info_tag.find('a')\n",
+    "    author = \"Anonymous\" if anchor_tag is None else anchor_tag.get_text()\n",
+    "\n",
+    "    link = 0\n",
+    "\n",
+    "    post_ = tr_tag.find('div', class_='post')\n",
+    "    texts = []\n",
+    "    for child in post_.children:\n",
+    "        if isinstance(child, NavigableString):\n",
+    "            texts.append(child.strip())\n",
+    "        elif child.has_attr('class') and 'ul' in child['class']:\n",
+    "            link = 1\n",
+    "            texts.append(child.get_text(strip=True))\n",
+    "    post = ' '.join(texts)\n",
+    "\n",
+    "    topic = headerandpost.find('div', class_='subject').get_text()\n",
+    "\n",
+    "    image = headerandpost.find('div', class_='post').find_all('img')\n",
+    "    if remove_emoji:\n",
+    "        image = remove_emojis(image)\n",
+    "    image_ = min(len(image), 1)\n",
+    "    \n",
+    "    video = headerandpost.find('div', class_='post').find('video')\n",
+    "    video_ = 0 if video is None else 1\n",
+    "    attachment = max(image_, video_)\n",
+    "\n",
+    "    original_info = headerandpost\n",
+    "\n",
+    "    return {\n",
+    "        \"timestamp\": timestamp,\n",
+    "        \"last_edit\": last_edit,\n",
+    "        \"author\": author.strip(),\n",
+    "        \"post\": post.strip(),\n",
+    "        \"topic\": topic.strip(),\n",
+    "        \"attachment\": attachment,\n",
+    "        \"link\": link,\n",
+    "        \"original_info\": original_info,\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "# A utility function to save a list (e.g., URLs) to a text file.\n",
+    "# Each item in the list gets its own line in the file.\n",
+    "def save_page_file(data, file_name):\n",
+    "    with open(file_name, \"w\") as filehandle:\n",
+    "        for listitem in data:\n",
+    "            filehandle.write(\"%s\\n\" % listitem)\n",
+    "\n",
+    "def get_post_max_page(url, USER_AGENTS):\n",
+    "    soup = get_web_content(url, USER_AGENTS)\n",
+    "    # Finding the pagination links based on their HTML structure and CSS classes.\n",
+    "    first_td = soup.find('td', class_='middletext')\n",
+    "    nav_pages_links = first_td.find_all('a', class_='navPages')\n",
+    "\n",
+    "    href_links = [int(link.text) if link.text.isdigit() else 0 for link in nav_pages_links]\n",
+    "    if len(href_links) == 0:\n",
+    "        # print('No pagination links found: ', url)\n",
+    "        return 1\n",
+    "    m = max(href_links)\n",
+    "    # we can't use more than 10 pages\n",
+    "    m = m if m < 10 else 10\n",
+    "    return m\n",
+    "\n",
+    "\n",
+    "# def parse_args():\n",
+    "#     parser = argparse.ArgumentParser()\n",
+    "#     parser.add_argument(\"url\", help=\"url for the extraction\")\n",
+    "#     parser.add_argument(\"--update\", help=\"extract updated data\", action=\"store_true\")\n",
+    "#     parser.add_argument(\"--board\", help=\"board name\")\n",
+    "#     parser.add_argument(\"--num_of_pages\", '-pages', help=\"number of pages to extract\", type=int)\n",
+    "#     parser.add_argument(\"--num_of_posts_start\", '-posts', help=\"the number of posts start to extract\", type=int, default=0)\n",
+    "\n",
+    "#     parser.add_argument(\"remove_emoji\", help=\"remove emoji from the post\", action=\"store_true\")\n",
+    "#     return vars(parser.parse_args())\n",
+    "\n",
+    "# \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mining_section = True\n",
+    "if mining_section:\n",
+    "    url = \"https://bitcointalk.org/index.php?board=14.0\"\n",
+    "else:\n",
+    "    url = \"https://bitcointalk.org/index.php?board=1.0\"\n",
+    "update = False\n",
+    "\n",
+    "if DEMO_MODE:\n",
+    "    board = \"Demo\"\n",
+    "    num_of_pages = 1\n",
+    "    num_of_posts_start = 0\n",
+    "else:\n",
+    "    board = \"Bitcoin\"\n",
+    "    num_of_pages = 1528\n",
+    "    num_of_posts_start = 248\n",
+    "\n",
+    "\n",
+    "\n",
+    "remove_emoji = True\n",
+    "\n",
+    "USER_AGENTS = get_web_component()\n",
+    "# Ensuring the data directory exists.\n",
+    "os.makedirs(f\"data/{board}/\", exist_ok=True)\n",
+    "pages_file_path = f\"data/{board}/pages_urls.txt\"\n",
+    "post_file_path = f\"data/{board}/post_urls.txt\"\n",
+    "# If the user chose to update the data, existing files are deleted to make way for new data.\n",
+    "if update:\n",
+    "    if os.path.exists(pages_file_path):\n",
+    "        os.remove(pages_file_path)\n",
+    "    if os.path.exists(post_file_path):\n",
+    "        os.remove(post_file_path)\n",
+    "        \n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "loop_through_source_url:  0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# If the pages file doesn't exist, the script collects page URLs.\n",
+    "if not os.path.exists(pages_file_path):\n",
+    "    pages_urls = loop_through_source_url(USER_AGENTS, url, num_of_pages)\n",
+    "    save_page_file(pages_urls, pages_file_path)\n",
+    "# Reading the existing page URLs from the file.\n",
+    "with open(pages_file_path, \"r\") as filehandle:\n",
+    "    pages_urls = [\n",
+    "        current_place.rstrip() for current_place in filehandle.readlines()\n",
+    "    ]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "  0%|          | 0/52 [00:01<?, ?it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "# If the posts file doesn't exist, the script collects post URLs.\n",
+    "if not os.path.exists(post_file_path):\n",
+    "    post_urls = loop_through_pages(USER_AGENTS, pages_urls)\n",
+    "    save_page_file(post_urls, post_file_path)\n",
+    "# Reading the existing post URLs from the file.\n",
+    "with open(post_file_path, \"r\") as filehandle:\n",
+    "    post_urls = [current_place.rstrip() for current_place in filehandle.readlines()]\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# # for post_url in [\"https://bitcointalk.org/index.php?topic=1306983.0\"]:\n",
+    "# for post_url in [\"https://bitcointalk.org/index.php?topic=5489570.0\"]:\n",
+    "#     time.sleep(0.8)\n",
+    "#     num_of_post_pages = get_post_max_page(post_url, USER_AGENTS)\n",
+    "#     loop_through_posts(USER_AGENTS, post_url, board, num_of_post_pages, remove_emoji)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 39/39 [01:02<00:00,  1.60s/it]\n"
+     ]
+    }
+   ],
+   "source": [
+    "post_urls_to_process = []\n",
+    "for post_url in post_urls:\n",
+    "    topic_id = post_url.split('topic=')[1]\n",
+    "    if os.path.exists(f'data/{board}/data_{topic_id}.csv'):\n",
+    "        # print(f'data/{board}/data_{topic_id}.csv already exists')\n",
+    "        continue\n",
+    "    post_urls_to_process.append(post_url)\n",
+    "\n",
+    "\n",
+    "# for (i,post_url) in enumerate(post_urls_to_process):\n",
+    "for post_url in tqdm(post_urls_to_process):\n",
+    "    num_of_post_pages = get_post_max_page(post_url, USER_AGENTS)\n",
+    "    loop_through_posts(USER_AGENTS, post_url, board, num_of_post_pages, remove_emoji)\n",
+    "    # print(f'{i+1}/{len(post_urls_to_process)} urls done')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import time\n",
+    "# import winsound\n",
+    "# from tqdm import tqdm\n",
+    "# winsound.MessageBeep(winsound.MB_ICONEXCLAMATION)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "py310",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

scraper/main.py ADDED Viewed

	@@ -0,0 +1,258 @@

+# Importing necessary libraries:
+# - os, json, time for file, data and time operations respectively.
+# - requests for making HTTP requests.
+# - BeautifulSoup for parsing HTML content.
+# - Other imports for logging, data manipulation, progress indication, and more.
+import os
+import json
+import time
+import munch
+import requests
+import argparse
+import pandas as pd
+from tqdm import tqdm
+from datetime import date
+from loguru import logger
+from random import randint
+from bs4 import BeautifulSoup, NavigableString
+from preprocessing.preprocessing_sub_functions import remove_emojis
+from topic_crawling import loop_through_posts
+# This function reads a JSON file named "website_format.json".
+# The file contain a list of user agents.
+# User agents are strings that browsers send to websites to identify themselves.
+# This list is likely used to rotate between different user agents when making requests,
+# making the scraper seem like different browsers and reducing the chances of being blocked.
+def get_web_component():
+    with open("website_format.json") as json_file:
+        website_format = json.load(json_file)
+    website_format = munch.munchify(website_format)
+    return website_format.USER_AGENTS
+# This function fetches a webpage's content.
+# It randomly selects a user agent from the provided list to make the request.
+# After fetching, it uses BeautifulSoup to parse the page's HTML content.
+def get_web_content(url, USER_AGENTS):
+    random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)]
+    headers = {"User-Agent": random_agent}
+    req = requests.get(url, headers=headers)
+    req.encoding = req.apparent_encoding
+    soup = BeautifulSoup(req.text, features="lxml")
+    return soup
+# This function extracts pagination links from a page.
+# These links point to other pages of content, often seen at the bottom of forums or search results.
+# The function returns both the individual page links and the "next" link,
+# which points to the next set of results.
+def get_pages_urls(url, USER_AGENTS):
+    time.sleep(1)
+    soup = get_web_content(url, USER_AGENTS)
+    # Finding the pagination links based on their HTML structure and CSS classes.
+    first_td = soup.find("td", class_="middletext", id="toppages")
+    nav_pages_links = first_td.find_all("a", class_="navPages")
+    href_links = [link["href"] for link in nav_pages_links]
+    next_50_link = href_links[-3]  # Assuming the third-last link is the "next" link.
+    href_links.insert(0, url)
+    return href_links, next_50_link
+# This function extracts individual post URLs from a page.
+# It's likely targeting a forum or blog structure, where multiple posts or threads are listed on one page.
+def get_post_urls(url, USER_AGENTS):
+    time.sleep(1)
+    soup = get_web_content(url, USER_AGENTS)
+    # Finding post links based on their HTML structure and CSS classes.
+    links_elements = soup.select("td.windowbg span a")
+    links = [link["href"] for link in links_elements]
+    # # If including rules and announcements posts
+    # links_elements = soup.select('td.windowbg3 span a')
+    # links_ = [link['href'] for link in links_elements]
+    # links.extend(links_)
+    return links
+# This function loops through the main page and its paginated versions to collect URLs.
+# It repeatedly calls 'get_pages_urls' to fetch batches of URLs until the desired number (num_of_pages) is reached.
+def loop_through_source_url(USER_AGENTS, url, num_of_pages):
+    pages_urls = []
+    counter = 0
+    while len(pages_urls) != num_of_pages:
+        href_links, next_50_link = get_pages_urls(url, USER_AGENTS)
+        pages_urls.extend(href_links)
+        pages_urls = list(dict.fromkeys(pages_urls))  # Remove any duplicate URLs.
+        url = next_50_link
+    return pages_urls
+# This function loops through the provided list of page URLs and extracts post URLs from each of these pages.
+# It ensures that there are no duplicate post URLs by converting the list into a dictionary and back to a list.
+# It returns a list of unique post URLs.
+def loop_through_pages(USER_AGENTS, pages_urls):
+    post_urls = []
+    for url in tqdm(pages_urls):
+        herf_links = get_post_urls(url, USER_AGENTS)
+        post_urls.extend(herf_links)
+        post_urls = list(dict.fromkeys(post_urls))
+    return post_urls
+# This function processes a post page. It extracts various details like timestamps, author information, post content, topic, attachments, links, and original HTML information.
+# The function returns a dictionary containing all this extracted data.
+def read_subject_page(USER_AGENTS, post_url, df, remove_emoji):
+    time.sleep(1)
+    soup = get_web_content(post_url, USER_AGENTS)
+    form_tag = soup.find("form", id="quickModForm")
+    table_tag = form_tag.find("table", class_="bordercolor")
+    td_tag = table_tag.find_all("td", class_="windowbg")
+    td_tag.extend(table_tag.find_all("td", class_="windowbg2"))
+    for comment in tqdm(td_tag):
+        res = extract_useful_content_windowbg(comment, remove_emoji)
+        if res is not None:
+            df = pd.concat([df, pd.DataFrame([res])])
+    return df
+# This function extracts meaningful content from a given HTML element (`tr_tag`). This tag is likely a row in a table, given its name.
+# The function checks the presence of specific tags and classes within this row to extract information such as timestamps, author, post content, topic, attachments, and links.
+# The extracted data is returned as a dictionary.
+def extract_useful_content_windowbg(tr_tag, remove_emoji=True):
+    """
+    Timestamp of the post (ex: September 11, 2023, 07:49:45 AM; but if you want just 11/09/2023 is enough)
+    Author of the post (ex: SupermanBitcoin)
+    The post itself
+    The topic where the post was posted (ex: [INFO - DISCUSSION] Security Budget Problem) eg.  Whats your thoughts: Next-Gen Bitcoin Mining Machine With 1X Efficiency Rating.
+    Number of characters in the post --> so this is an integer
+    Does the post contain at least one attachment (image, video etc.) --> if yes put '1' in the column, if no, just put '0'
+    Does the post contain at least one link --> if yes put '1' in the column, if no, just put '0'
+    """
+    headerandpost = tr_tag.find("td", class_="td_headerandpost")
+    if not headerandpost:
+        return None
+    timestamp = headerandpost.find("div", class_="smalltext").get_text()
+    timestamps = timestamp.split("Last edit: ")
+    timestamp = timestamps[0].strip()
+    last_edit = None
+    if len(timestamps) > 1:
+        if 'Today ' in timestamps[1]:
+            last_edit = date.today().strftime("%B %d, %Y")+', '+timestamps[1].split('by')[0].split("Today at")[1].strip()
+        last_edit = timestamps[1].split('by')[0].strip()
+    poster_info_tag = tr_tag.find('td', class_='poster_info')
+    anchor_tag = poster_info_tag.find('a')
+    author = "Anonymous" if anchor_tag is None else anchor_tag.get_text()
+    link = 0
+    post_ = tr_tag.find('div', class_='post')
+    texts = []
+    for child in post_.children:
+        if isinstance(child, NavigableString):
+            texts.append(child.strip())
+        elif child.has_attr('class') and 'ul' in child['class']:
+            link = 1
+            texts.append(child.get_text(strip=True))
+    post = ' '.join(texts)
+    topic = headerandpost.find('div', class_='subject').get_text()
+    image = headerandpost.find('div', class_='post').find_all('img')
+    if remove_emoji:
+        image = remove_emojis(image)
+    image_ = min(len(image), 1)
+    video = headerandpost.find('div', class_='post').find('video')
+    video_ = 0 if video is None else 1
+    attachment = max(image_, video_)
+    original_info = headerandpost
+    return {
+        "timestamp": timestamp,
+        "last_edit": last_edit,
+        "author": author.strip(),
+        "post": post.strip(),
+        "topic": topic.strip(),
+        "attachment": attachment,
+        "link": link,
+        "original_info": original_info,
+    }
+# A utility function to save a list (e.g., URLs) to a text file.
+# Each item in the list gets its own line in the file.
+def save_page_file(data, file_name):
+    with open(file_name, "w") as filehandle:
+        for listitem in data:
+            filehandle.write("%s\n" % listitem)
+def get_post_max_page(url, USER_AGENTS):
+    soup = get_web_content(url, USER_AGENTS)
+    # Finding the pagination links based on their HTML structure and CSS classes.
+    first_td = soup.find('td', class_='middletext')
+    nav_pages_links = first_td.find_all('a', class_='navPages')
+    href_links = [int(link.text) if link.text.isdigit() else 0 for link in nav_pages_links]
+    return max(href_links)
+# This function sets up command-line arguments for the script, allowing users to provide input without modifying the code.
+# Possible inputs include the starting URL, whether or not to update data, the board's name, and how many pages or posts to process.
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("url", help="url for the extraction")
+    parser.add_argument("--update", help="extract updated data", action="store_true")
+    parser.add_argument("--board", help="board name")
+    parser.add_argument("--num_of_pages", '-pages', help="number of pages to extract", type=int)
+    parser.add_argument("--num_of_posts_start", '-posts', help="the number of posts start to extract", type=int, default=0)
+    parser.add_argument("remove_emoji", help="remove emoji from the post", action="store_true")
+    return vars(parser.parse_args())
+def main(url, update, board, num_of_pages, num_of_posts_start, remove_emoji):
+    USER_AGENTS = get_web_component()
+    # Ensuring the data directory exists.
+    os.makedirs(f"data/{board}/", exist_ok=True)
+    pages_file_path = f"data/{board}/pages_urls.txt"
+    post_file_path = f"data/{board}/post_urls.txt"
+    # If the user chose to update the data, existing files are deleted to make way for new data.
+    if update:
+        if os.path.exists(pages_file_path):
+            os.remove(pages_file_path)
+        if os.path.exists(post_file_path):
+            os.remove(post_file_path)
+    # If the pages file doesn't exist, the script collects page URLs.
+    if not os.path.exists(pages_file_path):
+        pages_urls = loop_through_source_url(USER_AGENTS, url, num_of_pages)
+        save_page_file(pages_urls, pages_file_path)
+    # Reading the existing page URLs from the file.
+    with open(pages_file_path, "r") as filehandle:
+        pages_urls = [
+            current_place.rstrip() for current_place in filehandle.readlines()
+        ]
+    # If the posts file doesn't exist, the script collects post URLs.
+    if not os.path.exists(post_file_path):
+        post_urls = loop_through_pages(USER_AGENTS, pages_urls)
+        save_page_file(post_urls, post_file_path)
+    # Reading the existing post URLs from the file.
+    with open(post_file_path, "r") as filehandle:
+        post_urls = [current_place.rstrip() for current_place in filehandle.readlines()]
+    for post_url in tqdm(post_urls[num_of_posts_start:]):
+        time.sleep(0.8)
+        num_of_post_pages = get_post_max_page(post_url, USER_AGENTS)
+        loop_through_posts(USER_AGENTS, post_url, board, num_of_post_pages, remove_emoji)
+if __name__ == "__main__":
+    main(**parse_args())

scraper/preprocessing/__pycache__/preprocessing_sub_functions.cpython-310.pyc ADDED Viewed

Binary file (7.91 kB). View file

scraper/preprocessing/preprocessing.py ADDED Viewed

	@@ -0,0 +1,130 @@

+# Importing standard libraries
+import os
+import glob
+import argparse
+import pandas as pd
+from tqdm import tqdm
+from pathlib import Path
+# Additional preprocessing functions are imported from another module.
+from preprocessing_sub_functions import *
+# This function returns a list of all CSV files in the given directory path.
+def get_files(path):
+    return glob.glob(path + "/*.csv")
+# This function aims to remove meta information from the text.
+# The specifics of what meta information is removed depends on the function 'remove_meta_info'.
+def raw_preprocess(text):
+    text = remove_meta_info(text)
+    return text
+# A comprehensive text preprocessing function that applies several common preprocessing steps:
+# - URLs are removed from the text.
+# - The entire text is converted to lowercase to ensure uniformity.
+# - Punctuation is stripped from the text.
+# - Extra whitespaces (if any) are removed.
+# - The text is tokenized (split into individual words or tokens).
+# - Contractions (like "can't" or "won't") are expanded to their full forms.
+# - Common words (stopwords) that don't add significant meaning are removed.
+# Finally, the cleaned tokens are joined back into a string.
+def text_preprocess(text):
+    text = remove_urls(text)
+    text = to_lowercase(text)
+    text = remove_sentence_punctuation(text)
+    text = remove_extra_whitespace(text)
+    tokens = tokenize(text)
+    tokens = expand_contractions(tokens)
+    tokens = remove_stopwords(tokens)
+    text = " ".join(tokens)
+    return text
+# This function preprocesses a dataframe.
+# Specific preprocessing steps include:
+# - Removing rows marked as 'deleted'.
+# - Removing posts marked as 'deleted'.
+# - Updating the 'lastEdit' column.
+# - Converting timestamps to a datetime format.
+# - Renaming the 'timestamp' column to 'start_edit'.
+def csv_preprocess(df):
+    df = remove_deleted(df)
+    df = remove_deleted_post(df)
+    df = update_lastEdit(df)
+    df = convert_to_datetime(df)
+    df.rename(columns={"timestamp": "start_edit"}, inplace=True)
+    return df
+# This function processes individual CSV files:
+# - Reads the CSV into a DataFrame.
+# - Applies dataframe preprocessing.
+# - Applies raw text preprocessing to the 'post' column.
+# - Saves the raw preprocessed data into a 'raw-data' folder.
+# - Applies comprehensive text preprocessing to the 'post' column.
+# - Saves the fully preprocessed data into a 'preprocessed-data' folder.
+def loop_through_csvs(filePath):
+    file = os.path.basename(filePath)
+    folder = os.path.basename(os.path.dirname(filePath))
+    df = pd.read_csv(filePath)
+    df = csv_preprocess(df)
+    # Create a directory for raw data if it doesn't exist.
+    raw_folder = Path(f"raw-data/{folder}")
+    raw_folder.mkdir(parents=True, exist_ok=True)
+    # Apply raw preprocessing to the 'post' column of the dataframe.
+    df["post"] = df["post"].apply(raw_preprocess)
+    # Sort the dataframe by the 'last_edit' column.
+    df.sort_values(by=["last_edit"], inplace=True)
+    # Save the raw preprocessed dataframe to a CSV file.
+    df.to_csv(f"{raw_folder}/{file}", index=False)
+    # Create a directory for fully preprocessed data if it doesn't exist.
+    clean_folder = Path(f"preprocessed-data/{folder}")
+    clean_folder.mkdir(parents=True, exist_ok=True)
+    # Apply the comprehensive text preprocessing to the 'post' column and store the result in a new column.
+    df["preprocessed_post"] = df["post"].apply(text_preprocess)
+    # Sort the dataframe by the 'last_edit' column again.
+    df.sort_values(by=["last_edit"], inplace=True)
+    # Save the fully preprocessed dataframe to a CSV file.
+    df.to_csv(f"{clean_folder}/{file}", index=False)
+    return df
+# A function to parse command-line arguments.
+# The script expects a 'path' argument which indicates the directory where the raw CSV files are located.
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("path", help="path for the extraction")
+    return vars(parser.parse_args())
+# The main function of the script:
+# - It retrieves all the CSV files from the specified directory.
+# - Loops through each file, applying the preprocessing steps.
+# - If an error occurs during processing, the error message is appended to an 'error_log.txt' file.
+def main(path):
+    print(f'Preprocessing data in {path}')
+    rawFiles = get_files(path)
+    for filePath in tqdm(rawFiles):
+        try:
+            df = loop_through_csvs(filePath)
+        except Exception as e:
+            # If an error occurs, log the error message to a file.
+            with open(f"{path}/error_log.txt", "a") as f:
+                f.write(f"{filePath} -- {e}\\n")
+            continue
+if __name__ == "__main__":
+    main(**parse_args())

scraper/preprocessing/preprocessing.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/mining_support
+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/mining_speculation
+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/miners
+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/hardware
+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/groupbuys
+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/pools
+python preprocessing.py /local/home/puwong/bitcoin/bitcointalk_crawler/data/mining

scraper/preprocessing/preprocessing_sub_functions.py ADDED Viewed

	@@ -0,0 +1,248 @@

+# preprocessing sub functions
+import re
+import os
+import glob
+import string
+import pandas as pd
+from datetime import datetime
+import nltk
+from nltk.corpus import stopwords
+from nltk.stem import WordNetLemmatizer
+import contractions
+def remove_deleted(df):
+    r"""
+    remove_deleted function.
+    This function appears to remove deleted post from crawled website data.
+    Args:
+        df: dataframe of crawled website data.
+    Returns:
+        df: dataframe of crawled website data without deleted post.
+    """
+    # Remove rows where the 'timestamp' column is numeric
+    df = df[~df['timestamp'].str.isnumeric()]
+    df.reset_index(drop=True, inplace=True)
+    return df
+def remove_deleted_post(df):
+    r"""
+    remove_deleted_post function.
+    This function appears to remove deleted post where is in another format.
+    Args:
+        df: dataframe of crawled website data.
+    Returns:
+        df: dataframe of crawled website data without deleted post.
+    """
+    # Remove rows where the 'post' column contains 'del'
+    df = df[df['post'] != 'del']
+    df.reset_index(drop=True, inplace=True)
+    return df
+def update_lastEdit(df):
+    r"""
+    update_lastEdit function.
+    This function appears to fill NaN values in the 'last_edit' column with corresponding values from the 'timestamp' column
+    Args:
+        df: dataframe of crawled website data.
+    Returns:
+        df: dataframe of crawled website data with updated last_edit.
+    """
+    df.loc[:, 'last_edit'] = df['last_edit'].fillna(df['timestamp'])
+    return df
+def preprocess_date(date_str):
+    r"""
+    preprocess_date function.
+    This function appears to convert occurrences of 'Today' in a date string to the current date
+    Args:
+        date_str: str that contains date information.
+    Returns:
+        str that contains date information with updated 'Today' to current date.
+    """
+    if "Today " in date_str:
+        current_date = datetime.now().strftime("%B %d, %Y")
+        return date_str.replace("Today", current_date)
+    return date_str
+def convert_datetime_with_multiple_formats(date_str, formats):
+    r"""
+    convert_datetime_with_multiple_formats function.
+    This function appears to Convert a date string to a datetime object using multiple possible formats.
+    Args:
+        date_str: str that contains date information.
+        formats: list of possible date formats.
+    Returns:
+        datetime object.
+    """
+    for fmt in formats:
+        try:
+            return pd.to_datetime(date_str, format=fmt)
+        except ValueError:
+            continue
+    raise ValueError(f"Time data {date_str} doesn't match provided formats")
+def convert_to_datetime(df_):
+    r"""
+    convert_to_datetime function.
+    This function appears to convert 'timestamp' and 'last_edit' columns to datetime format
+    Args:
+        df_: dataframe of crawled website data.
+    Returns:
+        df: dataframe of crawled website data with datatime format in 'timestamp' and 'last_edit' columns.
+    """
+    df = df_.copy()
+    # Preprocess 'timestamp' and 'last_edit' columns to handle 'Today' values
+    df['timestamp'] = df['timestamp'].apply(preprocess_date)
+    df['last_edit'] = df['last_edit'].apply(preprocess_date)
+    # List of potential datetime formats
+    datetime_formats = ["%B %d, %Y at %I:%M:%S %p", "%B %d, %Y, %I:%M:%S %p"]
+    df['timestamp'] = df['timestamp'].apply(
+        convert_datetime_with_multiple_formats, formats=datetime_formats)
+    df['timestamp'] = df['timestamp'].dt.date
+    df['last_edit'] = df['last_edit'].apply(
+        convert_datetime_with_multiple_formats, formats=datetime_formats)
+    df['last_edit'] = df['last_edit'].dt.date
+    return df
+def remove_urls(text):
+    r"""
+    remove_urls function.
+    This function appears to Remove URLs from a text.
+    """
+    return re.sub(r'http\S+', '', text)
+#
+def remove_extra_whitespace(text):
+    r"""
+    remove_extra_whitespace function.
+    This function appears to Remove extra whitespace characters from a text.
+    """
+    return ' '.join(text.split())
+def remove_special_characters(text):
+    r"""
+    remove_special_characters function.
+    This function appears to remove special characters from a text.
+    """
+    return re.sub(r'[^\w\s]', '', text)
+def to_lowercase(text):
+    r"""
+    to_lowercase function.
+    This function appears to convert a text to lowercase.
+    """
+    return text.lower()
+def remove_meta_info(text):
+    r"""
+    remove_meta_info function.
+    This function appears to remove meta information where it contain quotes information.
+    """
+    text = str(text)
+    return re.sub(r'Quote from: [a-zA-Z0-9_]+ on [a-zA-Z0-9, :]+ (AM|PM)', '', text)
+def tokenize(text):
+    r"""
+    tokenize function.
+    This function appears to Tokenize a text into individual words.
+    """
+    return text.split(' ')
+def remove_sentence_punctuation(text):
+    r"""
+    remove_sentence_punctuation function.
+    This function appears to remove punctuation from a text, excluding math symbols.
+    """
+    math_symbols = "+-×*÷/=()[]{},.<>%^"
+    punctuations_to_remove = ''.join(
+        set(string.punctuation) - set(math_symbols))
+    return text.translate(str.maketrans(punctuations_to_remove, ' ' * len(punctuations_to_remove)))
+def lemmatize_text(text):
+    r"""
+    lemmatize_text function.
+    This function appears to lemmatize text, where it convert words to their base form.
+    """
+    lemmatizer = WordNetLemmatizer()
+    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
+def replace_numbers(text, replace_with="<NUM>"):
+    r"""
+    replace_numbers function.
+    This function appears to replace numbers in a text with a specified string (default is "<NUM>").
+    """
+    return re.sub(r'\b\d+\b', replace_with, text)
+def remove_stopwords(tokens):
+    r"""
+    remove_stopwords function.
+    This function appears to remove stopwords from a list of tokens.
+    """
+    stop_words = set(stopwords.words('english'))
+    return [word for word in tokens if word not in stop_words]
+def expand_contractions(tokens):
+    r"""
+    expand_contractions function.
+    This function appears to expand contractions in a list of tokens (e.g., "isn't" to "is not")
+    """
+    return [contractions.fix(word) for word in tokens]
+def remove_repeated_phrases(text):
+    r"""
+    remove_repeated_phrases function.
+    This function appears to remove repeated phrases from a text.
+    eg. "hello hello world" -> "hello world"
+    """
+    phrases = text.split()
+    seen = set()
+    output = []
+    for phrase in phrases:
+        if phrase not in seen:
+            seen.add(phrase)
+            output.append(phrase)
+    return ' '.join(output)
+def remove_emojis(images):
+    pattern = r"https://bitcointalk\.org/Smileys/default/[a-zA-Z0-9_-]+\.gif"
+    filtered_images = []
+    for i in images:
+        emoji_urls = re.findall(pattern, i["src"])
+        if len(emoji_urls)<1:
+            filtered_images.append(i)
+    return filtered_images

scraper/sort.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import os
+import pandas as pd
+from pathlib import Path
+from tqdm import tqdm
+# Function to process and sort CSV files within a given folder
+def process_csvs(folder_path, new_folder_name):
+    # Extracting the name of the board from the folder path
+    board = os.path.basename(folder_path)
+    # Creating a new directory to store the sorted CSV files
+    sorted_folder = Path(new_folder_name)
+    sorted_folder.mkdir(parents=True, exist_ok=True)
+    # Retrieving all CSV files from the given folder path
+    all_files = [
+        os.path.join(folder_path, file)
+        for file in os.listdir(folder_path)
+        if file.endswith(".csv")
+    ]
+    # Reading each CSV file into a dataframe
+    list_of_dataframes = [pd.read_csv(file) for file in all_files]
+    # Combining all dataframes into a single dataframe
+    combined_df = pd.concat(list_of_dataframes, ignore_index=True)
+    # Sorting the combined dataframe based on the "last_edit" column
+    combined_df = combined_df.sort_values(by="last_edit")
+    # Splitting the sorted dataframe into chunks of 10,000 rows each
+    num_chunks = len(combined_df) // 10000 + (1 if len(combined_df) % 10000 else 0)
+    chunks = [combined_df.iloc[i * 10000 : (i + 1) * 10000] for i in range(num_chunks)]
+    # Saving each chunk as a separate CSV with a filename based on date ranges
+    for idx, chunk in tqdm(enumerate(chunks)):
+        start_date = pd.to_datetime(chunk["last_edit"].iloc[0]).strftime("%d%m%y")
+        end_date = pd.to_datetime(chunk["last_edit"].iloc[-1]).strftime("%d%m%y")
+        filename = f"BitcoinForum_{board}_{start_date}_to_{end_date}.csv"
+        chunk.to_csv(os.path.join(sorted_folder, filename), index=False)
+folder_paths = [
+    "./raw-data",
+    "./preprocessed-data",
+]
+# Iterating over each folder path and processing its CSV files
+for folder_path in folder_paths:
+    folder_name = os.path.basename(folder_path)
+    new_folder_name = f"sorted-{folder_name}"
+    for folder in tqdm(os.listdir(folder_path)):
+        if os.path.isdir(os.path.join(folder_path, folder)):
+            process_csvs(os.path.join(folder_path, folder), new_folder_name)

scraper/topic_crawling.py ADDED Viewed

	@@ -0,0 +1,243 @@

+# Importing necessary libraries:
+# - os, json, time for file, data and time operations respectively.
+# - requests for making HTTP requests.
+# - BeautifulSoup for parsing HTML content.
+# - Other imports for logging, data manipulation, progress indication, and more.
+import os
+import json
+import time
+import munch
+import requests
+import argparse
+import pandas as pd
+from tqdm import tqdm
+from datetime import date
+from loguru import logger
+from random import randint
+from bs4 import BeautifulSoup, NavigableString
+import random
+from preprocessing.preprocessing_sub_functions import remove_emojis
+from torch import save,load
+# This function reads a JSON file named "website_format.json".
+# The file contain a list of user agents.
+# User agents are strings that browsers send to websites to identify themselves.
+# This list is likely used to rotate between different user agents when making requests,
+# making the scraper seem like different browsers and reducing the chances of being blocked.
+def get_web_component():
+    # Opening JSON file
+    with open("website_format.json") as json_file:
+        website_format = json.load(json_file)
+    website_format = munch.munchify(website_format)
+    return website_format.USER_AGENTS
+# This function fetches a webpage's content.
+# It randomly selects a user agent from the provided list to make the request.
+# After fetching, it uses BeautifulSoup to parse the page's HTML content.
+from collections import OrderedDict
+cache = OrderedDict()
+def get_web_content(url, USER_AGENTS):
+    global cache  # Refer to the global cache variable
+    # Check if the URL is in cache
+    if url in cache:
+        # print(f"Using cache for {url}")
+        return cache[url]
+    else:
+        # print(f"Waiting to fetch {url}")
+        time.sleep(0.7)  # Simulate delay
+        # print(f"Fetching {url}")
+        # Choose a random user agent
+        random_agent = USER_AGENTS[random.randint(0, len(USER_AGENTS) - 1)]
+        headers = {"User-Agent": random_agent}
+        # Fetch the web content
+        req = requests.get(url, headers=headers)
+        req.encoding = req.apparent_encoding
+        soup = BeautifulSoup(req.text, features="lxml")
+        # Add the new content to the cache
+        # If the cache already has 3 items, remove the oldest one
+        if len(cache) >= 3:
+            cache.popitem(last=False)  # Remove the oldest item
+        cache[url] = soup  # Add the new item
+        # print("Fetching done")
+        return soup
+# This function extracts pagination links from a page.
+# These links point to other pages of content, often seen at the bottom of forums or search results.
+# The function returns both the individual page links and the "next" link,
+# which points to the next set of results.
+def get_pages_urls(url, USER_AGENTS, next_50_pages):
+    soup = get_web_content(url, USER_AGENTS)
+    # Finding the pagination links based on their HTML structure and CSS classes.
+    first_td = soup.find('td', class_='middletext')
+    nav_pages_links = first_td.find_all('a', class_='navPages')
+    href_links = [link['href'] for link in nav_pages_links]
+    next_50_link = None
+    if next_50_pages:
+        next_50_link = href_links[-3] # HACK: Assuming the third-last link is the "next" link.
+    href_links.insert(0, url)
+    return href_links, next_50_link
+# This function loops through the main page and its paginated versions to collect URLs.
+# It repeatedly calls 'get_pages_urls' to fetch batches of URLs until the desired number (num_of_pages) is reached.
+def loop_through_source_url(USER_AGENTS, url, num_of_pages):
+    pages_urls = []
+    while len(pages_urls) < num_of_pages:
+        next_50_pages = num_of_pages >= 50
+        href_links, next_50_link = get_pages_urls(url, USER_AGENTS, next_50_pages)
+        pages_urls.extend(href_links)
+        pages_urls = list(dict.fromkeys(pages_urls))  # Remove any duplicate URLs.
+        url = next_50_link
+    return pages_urls
+def get_subpages_urls(url, USER_AGENTS):
+    soup = get_web_content(url, USER_AGENTS)
+    middletext = soup.find('td', class_='middletext')
+    nav_pages_links = middletext.find_all('a', class_='navPages')
+    return nav_pages_links[:-1]
+def loop_through_posts(USER_AGENTS, post_url, board, num_of_pages, remove_emoji):
+    # print("loop_through_posts: num_of_pages =",num_of_pages)
+    try:
+        href_links = loop_through_source_url(USER_AGENTS, post_url, num_of_pages)
+        df = pd.DataFrame(columns=['timestamp', 'last_edit', 'author', 'post', 'topic', 'attachment', 'link', 'original_info'])
+        for url in href_links[:1]: # we only need the first page to analyse the category of the thread
+            df = read_subject_page(USER_AGENTS, url, df, remove_emoji)
+        topic_id = post_url.split('topic=')[1]
+        df.to_csv(f'data/{board}/data_{topic_id}.csv', mode='w', index=False)
+    except Exception as e:
+        print(e)
+        with open(f"data/{board}/error_log.txt", "a") as f:
+            f.write(f"{post_url}\n -- {e}\n")
+# This function processes a post page. It extracts various details like timestamps, author information, post content, topic, attachments, links, and original HTML information.
+# The function returns a dictionary containing all this extracted data.
+def read_subject_page(USER_AGENTS, post_url, df, remove_emoji):
+    soup = get_web_content(post_url, USER_AGENTS)
+    form_tag = soup.find('form', id='quickModForm')
+    table_tag = form_tag.find('table', class_='bordercolor')
+    td_tag = table_tag.find_all('td', class_='windowbg')
+    td_tag.extend(table_tag.find_all('td', class_='windowbg2'))
+    for comment in td_tag:
+        res = extract_useful_content_windowbg(comment, remove_emoji)
+        if res is not None:
+            df = pd.concat([df, pd.DataFrame([res])])
+    return df
+# This function extracts meaningful content from a given HTML element (`tr_tag`). This tag is likely a row in a table, given its name.
+# The function checks the presence of specific tags and classes within this row to extract information such as timestamps, author, post content, topic, attachments, and links.
+# The extracted data is returned as a dictionary.
+def extract_useful_content_windowbg(tr_tag, remove_emoji=True):
+    """
+    Timestamp of the post (ex: September 11, 2023, 07:49:45 AM; but if you want just 11/09/2023 is enough)
+    Author of the post (ex: SupermanBitcoin)
+    The post itself
+    The topic where the post was posted (ex: [INFO - DISCUSSION] Security Budget Problem) eg.  Whats your thoughts: Next-Gen Bitcoin Mining Machine With 1X Efficiency Rating.
+    Number of characters in the post --> so this is an integer
+    Does the post contain at least one attachment (image, video etc.) --> if yes put '1' in the column, if no, just put '0'
+    Does the post contain at least one link --> if yes put '1' in the column, if no, just put '0'
+    """
+    headerandpost = tr_tag.find('td', class_='td_headerandpost')
+    if not headerandpost:
+        return None
+    timestamp = headerandpost.find('div', class_='smalltext').get_text()
+    timestamps = timestamp.split('Last edit: ')
+    timestamp = timestamps[0].strip()
+    last_edit = None
+    if len(timestamps) > 1:
+        if 'Today ' in timestamps[1]:
+            last_edit = date.today().strftime("%B %d, %Y")+', '+timestamps[1].split('by')[0].split("Today at")[1].strip()
+        last_edit = timestamps[1].split('by')[0].strip()
+    poster_info_tag = tr_tag.find('td', class_='poster_info')
+    anchor_tag = poster_info_tag.find('a')
+    author = "Anonymous" if anchor_tag is None else anchor_tag.get_text()
+    link = 0
+    post_ = tr_tag.find('div', class_='post')
+    texts = []
+    for child in post_.children:
+        if isinstance(child, NavigableString):
+            texts.append(child.strip())
+        elif child.has_attr('class') and 'ul' in child['class']:
+            link = 1
+            texts.append(child.get_text(strip=True))
+    post = ' '.join(texts)
+    topic = headerandpost.find('div', class_='subject').get_text()
+    image = headerandpost.find('div', class_='post').find_all('img')
+    if remove_emoji:
+        image = remove_emojis(image)
+    image_ = min(len(image), 1)
+    video = headerandpost.find('div', class_='post').find('video')
+    video_ = 0 if video is None else 1
+    attachment = max(image_, video_)
+    original_info = headerandpost
+    return {
+        'timestamp': timestamp,
+        'last_edit': last_edit,
+        'author': author.strip(),
+        'post': post.strip(),
+        'topic': topic.strip(),
+        'attachment': attachment,
+        'link': link,
+        'original_info': original_info,
+    }
+# A utility function to save a list (e.g., URLs) to a text file.
+# Each item in the list gets its own line in the file.
+def save_page_file(data, file_name):
+    with open(file_name, 'w') as filehandle:
+        for listitem in data:
+            filehandle.write('%s\n' % listitem)
+# This function sets up command-line arguments for the script, allowing users to provide input without modifying the code.
+# Possible inputs include the starting URL, whether or not to update data, the board's name, and how many pages or posts to process.
+def parse_args():
+    """Parse command-line arguments."""
+    parser = argparse.ArgumentParser()
+    parser.add_argument("url", help="url for the extraction")
+    parser.add_argument("--board", help="board name")
+    parser.add_argument("--num_of_pages", '-pages', help="number of pages to extract", type=int)
+    parser.add_argument("remove_emoji", help="remove emoji from the post", action="store_true")
+    return vars(parser.parse_args())
+def main(url, board, num_of_pages, remove_emoji):
+    USER_AGENTS = get_web_component()
+    loop_through_posts(USER_AGENTS, url, board, num_of_pages, remove_emoji)
+if __name__ == "__main__":
+    main(**parse_args())
+# python topic_crawling.py https://bitcointalk.org/index.php?topic=28402.0 --board miners --num_of_pages 843

scraper/website_format.json ADDED Viewed

	@@ -0,0 +1,130 @@

+{
+  "USER_AGENTS": [
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
+    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
+    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
+    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
+    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
+    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
+    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
+    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
+    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
+    "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
+    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
+    "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
+    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
+    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
+    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
+    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
+    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
+    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
+    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
+    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
+    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
+    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
+    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
+    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
+    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
+    "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
+    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
+    "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
+    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
+    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
+    "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30",
+    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
+    "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)"
+  ]
+}