Upload 261 files

19b102a verified almost 2 years ago

6.34 kB

	BERTopic approaches topic modeling as a cluster task and attempts to cluster semantically similar documents to extract common topics. A disadvantage of using such a method is that each document is assigned to a single cluster and therefore also a single topic. In practice, documents may contain a mixture of topics. This can be accounted for by splitting up the documents into sentences and feeding those to BERTopic.

	Another option is to use a cluster model that can perform soft clustering, like HDBSCAN. As BERTopic focuses on modularity, we may still want to model that mixture of topics even when we are using a hard-clustering model, like k-Means without the need to split up our documents. This is where `.approximate_distribution` comes in!

	<br>
	<div class="svg_image">
	--8<-- "docs/getting_started/distribution/approximate_distribution.svg"
	</div>
	<br>

	To perform this approximation, each document is split into tokens according to the provided tokenizer in the `CountVectorizer`. Then, a sliding window is applied on each document creating subsets of the document. For example, with a window size of 3 and stride of 1, the document:

	> Solving the right problem is difficult.

	can be split up into `solving the right`, `the right problem`, `right problem is`, and `problem is difficult`. These are called token sets.
	For each of these token sets, we calculate their c-TF-IDF representation and find out how similar they are to the previously generated topics.
	Then, the similarities to the topics for each token set are summed to create a topic distribution for the entire document.

	Although it is often said that documents can contain a mixture of topics, these are often modeled by assigning each word to a single topic.
	With this approach, we take into account that there may be multiple topics for a single word.

	We can make this multiple-topic word assignment a bit more accurate by then splitting these token sets up into individual tokens and assigning
	the topic distributions for each token set to each individual token. That way, we can visualize the extent to which a certain word contributes
	to a document's topic distribution.

	## Example

	To calculate our topic distributions, we first need to fit a basic topic model:

	```python
	from bertopic import BERTopic
	from sklearn.datasets import fetch_20newsgroups

	docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
	topic_model = BERTopic().fit(docs)
	```

	After doing so, we can approximate the topic distributions for your documents:

	```python
	topic_distr, _ = topic_model.approximate_distribution(docs)
	```

	The resulting `topic_distr` is a n x m matrix where n are the topics and m the documents. We can then visualize the distribution
	of topics in a document:

	```python
	topic_model.visualize_distribution(topic_distr[1])
	```

	<iframe src="distribution_viz.html" style="width:1000px; height: 620px; border: 0px;""></iframe>

	Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first
	calculate topic distributions on a token level and then visualize the results:

	```python
	# Calculate the topic distributions on a token-level
	topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)

	# Visualize the token-level distributions
	df = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])
	df
	```

	<br><br>
	<img src="distribution.png">
	<br><br>

	!!! tip
	You can also approximate the topic distributions for unseen documents. It will not be as accurate as `.transform` but it is quite fast and can serve you well in a production setting.

	!!! note
	To get the stylized dataframe for `.visualize_approximate_distribution` you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via `pip install jinja2`

	## Parameters
	There are a few parameters that are of interest which will be discussed below.


	### batch_size
	Creating token sets for each document can result in quite a large list of token sets. The similarity of these token sets with the topics can result a large matrix that might not fit into memory anymore. To circumvent this, we can process batches of documents instead to minimize the memory overload. The value for `batch_size` indicates the number of documents that will be processed at once:

	```python
	topic_distr, _ = topic_model.approximate_distribution(docs, batch_size=500)
	```

	### window
	The number of tokens that are combined into token sets are defined by the `window` parameter. Seeing as we are performing a sliding window, we can change the size of the window. A larger window takes more tokens into account but setting it too large can result in considering too much information. Personally, I like to have this window between 4 and 8:

	```python
	topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
	```

	### stride
	The sliding window that is performed on a document shifts, as a default, 1 token to the right each time to create its token sets. As a result, especially with large windows, a single token gets judged several times. We can use the `stride` parameter to increase the number of tokens the window shifts to the right. By increasing
	this value, we are judging each token less frequently which often results in a much faster calculation. Combining this parameter with `window` is preferred. For example, if we have a very large dataset, we can set `stride=4` and `window=8` to judge token sets that contain 8 tokens but that are shifted with 4 steps
	each time. As a result, this increases the computational speed quite a bit:

	```python
	topic_distr, _ = topic_model.approximate_distribution(docs, window=4)
	```

	### use_embedding_model
	As a default, we compare the c-TF-IDF calculations between the token sets and all topics. Due to its bag-of-word representation, this is quite fast. However, you might want to use the selected `embedding_model` instead to do this comparison. Do note that due to the many token sets, it is often computationally quite a bit slower:

	```python
	topic_distr, _ = topic_model.approximate_distribution(docs, use_embedding_model=True)
	```