{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Preprocesses e621's [data exports](https://e621.net/db_export/) and stores them in feather files. The feather format was chosen because it loads quickly!\n", "\n", "Usage notes:\n", "* Feel free to change `INPUT_FOLDER` and `OUTPUT_FOLDER` to anywhere you want to store your data.\n", "* `DATE` is whatever date is on your input files.\n", "* Files will only be generated if they don't already exist. Delete them if you want to regenerate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas\n", "import os\n", "from tqdm.notebook import tqdm\n", "tqdm.pandas()\n", "\n", "INPUT_FOLDER = \"H:/Data/TagSuggest/e621_metadata\"\n", "OUTPUT_FOLDER = \"H:/Data/TagSuggest/e621_dataframes\"\n", "DATE = \"2023-08-23\"\n", "\n", "os.makedirs(OUTPUT_FOLDER, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing to process is the tags themselves, since we'll be using their IDs\n", "* `tag_id` - An arbitrary number from e621's database. Very useful.\n", "* `name` - The tag!\n", "* `category` - A number to say whether it's an artist, species, and so on. Constants for these are defined elsewhere, this notebook doesn't need to know them.\n", "* `post_count` - The approximate number of posts the tag has. It's not perfectly aligned with the actual post data, but it's close enough for most purposes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tags_file = f\"{OUTPUT_FOLDER}/tags.feather\"\n", "if os.path.exists(tags_file):\n", " tags = pandas.read_feather(tags_file)\n", "else:\n", " tags = pandas.read_csv(f\"{INPUT_FOLDER}/tags-{DATE}.csv.gz\", na_values=[], keep_default_na=False).astype({\"name\":\"string\"}).rename(columns={\"id\": \"tag_id\"}).reset_index(drop=True)\n", " tags.to_feather(tags_file)\n", "tags.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tags_by_name = tags.copy(deep=True)\n", "tags_by_name.set_index(\"name\", inplace=True)\n", "tags_by_name.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This part takes a couple minutes! There are about 4 million posts to go through, and each one has the tags listed in string format, so they have to be parsed and translated to IDs for more compact storage. The progress bar is based on exactly four million posts, which is low now, but it's not worth actually counting the lines. Two dataframes are generated:\n", "\n", "* The posts file contains most of the post data.\n", " * `post_id` - From e621. Used for linking to the other dataframe.\n", " * `rating` - Whether the post is safe, questionable, or explicit. Handy if you want to generate SFW wildcards.\n", " * `score` - The overall user score of the post, if you're curious. Score doesn't necessarily correlate to aesthetic quality; posts can be highly upvoted because of their content or themes irrespective of their art style.\n", " * `up_score` - The upvote component of the score. Just guessing, but people probably upvote and downvote for totally different reasons, so it could be useful.\n", " * `down_score` - The downvote component of the score as a negative. If it's big, it's probably an unpopular niche kink or a political meme or something.\n", "* The post tags file stores the links between posts and tags as numbers. It's surprisingly large." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "post_tags_file = f\"{OUTPUT_FOLDER}/post_tags.feather\"\n", "posts_file = f\"{OUTPUT_FOLDER}/posts.feather\"\n", "if os.path.exists(post_tags_file) and os.path.exists(posts_file):\n", " post_tags = pandas.read_feather(post_tags_file)\n", " posts = pandas.read_feather(posts_file)\n", "else:\n", " post_tags_parts = []\n", " posts_parts = []\n", " with pandas.read_csv(f\"{INPUT_FOLDER}/posts-{DATE}.csv.gz\", usecols=[\"id\", \"tag_string\", \"is_deleted\", \"is_pending\", \"rating\", \"score\", \"up_score\", \"down_score\"], chunksize=100_000) as reader:\n", " progress = tqdm(total=4_000_000)\n", " for posts in reader:\n", " post_count = len(posts)\n", " posts: pandas.DataFrame\n", " posts = posts[posts[\"is_deleted\"] == \"f\"]\n", " posts = posts[posts[\"is_pending\"] == \"f\"]\n", " posts = posts.rename(columns={\"id\": \"post_id\"})\n", " posts_parts.append(posts[[\"post_id\", \"rating\", \"score\", \"up_score\", \"down_score\"]].astype({\"rating\":\"string\"}))\n", " posts = posts[[\"post_id\", \"tag_string\"]].set_index(\"post_id\")\n", " posts = posts.apply(lambda x: x.str.split(' ')).explode(\"tag_string\")\n", " posts = posts.join(tags_by_name, on=\"tag_string\")[[\"tag_id\"]].reset_index()\n", " post_tags_parts.append(posts[[\"post_id\", \"tag_id\"]])\n", " progress.update(post_count)\n", " post_tags = pandas.concat(post_tags_parts)\n", " post_tags.reset_index(drop=True, inplace=True)\n", " post_tags.to_feather(post_tags_file)\n", " posts = pandas.concat(posts_parts)\n", " posts.reset_index(drop=True, inplace=True)\n", " posts.to_feather(posts_file)\n", "print(\"\\npost_tags\")\n", "post_tags.info()\n", "print(\"\\nposts\")\n", "posts.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also generate and store two different ways of looking at the `post_tags` frame, because it's a lot faster to cache this once than to join a many-to-many frame that size for every single query. This can also take a few minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "posts_by_tag_file = f\"{OUTPUT_FOLDER}/posts_by_tag.feather\"\n", "if os.path.exists(posts_by_tag_file):\n", " posts_by_tag = pandas.read_feather(posts_by_tag_file)\n", "else:\n", " posts_by_tag = post_tags.groupby(\"tag_id\").progress_aggregate(list)\n", " posts_by_tag.reset_index(inplace=True)\n", " posts_by_tag.to_feather(posts_by_tag_file)\n", "posts_by_tag.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tags_by_post_file = f\"{OUTPUT_FOLDER}/tags_by_post.feather\"\n", "if os.path.exists(tags_by_post_file):\n", " tags_by_post = pandas.read_feather(tags_by_post_file)\n", "else:\n", " tags_by_post = post_tags.groupby(\"post_id\").progress_aggregate(list)\n", " tags_by_post.reset_index(inplace=True)\n", " tags_by_post.to_feather(tags_by_post_file)\n", "tags_by_post.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also make a SFW post tags list, then use it to build a list of tags that only appear in SFW posts. Optional." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "safe_posts_by_tag_file = f\"{OUTPUT_FOLDER}/safe_posts_by_tag.feather\"\n", "if os.path.exists(safe_posts_by_tag_file):\n", " safe_posts_by_tag = pandas.read_feather(safe_posts_by_tag_file)\n", "else:\n", " safe_posts_by_tag = post_tags.set_index(\"post_id\").join(posts.set_index(\"post_id\"))\n", " safe_posts_by_tag = safe_posts_by_tag[safe_posts_by_tag[\"rating\"].isin([\"s\"])].reset_index()\n", " safe_posts_by_tag = safe_posts_by_tag[[\"tag_id\", \"post_id\"]].groupby(\"tag_id\").progress_aggregate(list)\n", " safe_posts_by_tag.reset_index(inplace=True)\n", " safe_posts_by_tag.to_feather(safe_posts_by_tag_file)\n", "safe_posts_by_tag.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "safe_tags_by_post_file = f\"{OUTPUT_FOLDER}/safe_tags_by_post.feather\"\n", "if os.path.exists(safe_tags_by_post_file):\n", " safe_tags_by_post = pandas.read_feather(safe_tags_by_post_file)\n", "else:\n", " safe_tags_by_post = post_tags.set_index(\"post_id\").join(posts.set_index(\"post_id\"))\n", " safe_tags_by_post = safe_tags_by_post[safe_tags_by_post[\"rating\"].isin([\"s\"])].reset_index()\n", " safe_tags_by_post = safe_tags_by_post[[\"tag_id\", \"post_id\"]].groupby(\"post_id\").progress_aggregate(list)\n", " safe_tags_by_post.reset_index(inplace=True)\n", " safe_tags_by_post.to_feather(safe_tags_by_post_file)\n", "safe_tags_by_post.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "safe_tags_file = f\"{OUTPUT_FOLDER}/safe_tags.feather\"\n", "if os.path.exists(safe_tags_file):\n", " safe_tags = pandas.read_feather(safe_tags_file)\n", "else:\n", " safe_tags = safe_posts_by_tag.set_index(\"tag_id\").join(tags.set_index(\"tag_id\"), how=\"inner\")\n", " safe_tags[\"post_count\"] = safe_tags[\"post_id\"].apply(len)\n", " safe_tags = safe_tags[[\"name\", \"category\", \"post_count\"]]\n", " safe_tags.reset_index(inplace=True)\n", " safe_tags.to_feather(safe_tags_file)\n", "safe_tags.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And lastly, parse and store the implications file. Useful for filtering out tag suggestions that are implied by higher scoring ones, and for building the species hierarchy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "implications_file = f\"{OUTPUT_FOLDER}/implications.feather\"\n", "if os.path.exists(implications_file):\n", " implications = pandas.read_feather(implications_file)\n", "else:\n", " implications = pandas.read_csv(f\"{INPUT_FOLDER}/tag_implications-{DATE}.csv.gz\")\\\n", " .join(tags_by_name, on=\"antecedent_name\", how=\"inner\")\\\n", " .join(tags_by_name, on=\"consequent_name\", rsuffix=\"_con\")\\\n", " [[\"tag_id\", \"tag_id_con\"]]\\\n", " .rename(columns={\"tag_id\": \"antecedent_id\", \"tag_id_con\": \"consequent_id\"})\n", " implications.reset_index(inplace=True,drop=True)\n", " implications.to_feather(implications_file)\n", "implications.info()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }