{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "AXHLDxJdRzBi" }, "source": [ "# **BERTopic - Tutorial**\n", "We start with installing bertopic from pypi before preparing the data. \n", "\n", "**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!" ] }, { "cell_type": "markdown", "metadata": { "id": "Y3VGFZ1USMTu" }, "source": [ "# **Prepare data**\n", "For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts on 20 topics." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JJij3WP6SEQD", "outputId": "6b4d3f7b-9f7f-426f-dea8-ab1e5083eb94" }, "outputs": [], "source": [ "from bertopic import BERTopic\n", "from sklearn.datasets import fetch_20newsgroups\n", "\n", "docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']" ] }, { "cell_type": "markdown", "metadata": { "id": "SBcNmZJzSTY8" }, "source": [ "# **Create Topics**\n", "We select the \"english\" as the main language for our documents. If you want a multilingual model that supports 50+ languages, please select \"multilingual\" instead. " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "TfhfzqkoSJ1I", "outputId": "d51e1f2c-d5db-44b6-d172-58881d54d8e6" }, "outputs": [], "source": [ "model = BERTopic(language=\"english\")\n", "topics, probs = model.fit_transform(docs)" ] }, { "cell_type": "markdown", "metadata": { "id": "0ua80usww-rj" }, "source": [ "We can then extract most frequent topics:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 202 }, "id": "nNptKBzHSbyS", "outputId": "25855ec5-d642-4864-cc64-404df61fabc6" }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Topic | \n", "Count | \n", "
|---|---|---|
| 0 | \n", "-1 | \n", "6224 | \n", "
| 1 | \n", "0 | \n", "1833 | \n", "
| 2 | \n", "1 | \n", "586 | \n", "
| 3 | \n", "2 | \n", "526 | \n", "
| 4 | \n", "3 | \n", "480 | \n", "