Upload 261 files

19b102a verified almost 2 years ago

5.09 kB

	Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works.
	Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing
	if they make sense is an important factor in alleviating this issue.

	## Visualize Topics
	After having trained our `BERTopic` model, we can iteratively go through hundreds of topics to get a good
	understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.
	Instead, we can visualize the topics that were generated in a way very similar to
	[LDAvis](https://github.com/cpsievert/LDAvis).

	We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using
	plotly such that we can create an interactive view.

	First, we need to train our model:

	```python
	from bertopic import BERTopic
	from sklearn.datasets import fetch_20newsgroups

	docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
	topic_model = BERTopic()
	topics, probs = topic_model.fit_transform(docs)
	```

	Then, we can call `.visualize_topics` to create a 2D representation of your topics. The resulting graph is a
	plotly interactive graph which can be converted to HTML:

	```python
	topic_model.visualize_topics()
	```

	<iframe src="viz.html" style="width:1000px; height: 680px; border: 0px;""></iframe>

	You can use the slider to select the topic which then lights up red. If you hover over a topic, then general
	information is given about the topic, including the size of the topic and its corresponding words.


	## Visualize Topic Similarity
	Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity
	matrix by simply applying cosine similarities through those topic embeddings. The result will be a
	matrix indicating how similar certain topics are to each other.
	To visualize the heatmap, run the following:

	```python
	topic_model.visualize_heatmap()
	```

	<iframe src="heatmap.html" style="width:1000px; height: 720px; border: 0px;""></iframe>


	!!! note
	You can set `n_clusters` in `visualize_heatmap` to order the topics by their similarity.
	This will result in blocks being formed in the heatmap indicating which clusters of topics are
	similar to each other. This step is very much recommended as it will make reading the heatmap easier.

	## Visualize Topics over Time
	After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by
	leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency
	of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.
	Simply call `.visualize_topics_over_time` with the newly created topics over time:


	```python
	import re
	import pandas as pd
	from bertopic import BERTopic

	# Prepare data
	trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
	trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
	trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
	trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
	trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
	timestamps = trump.date.to_list()
	tweets = trump.text.to_list()

	# Create topics over time
	model = BERTopic(verbose=True)
	topics, probs = model.fit_transform(tweets)
	topics_over_time = model.topics_over_time(tweets, timestamps)
	```

	Then, we visualize some interesting topics:

	```python
	model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])
	```
	<iframe src="trump.html" style="width:1000px; height: 680px; border: 0px;""></iframe>

	## Visualize Topics per Class
	You might want to extract and visualize the topic representation per class. For example, if you have
	specific groups of users that might approach topics differently, then extracting them would help understanding
	how these users talk about certain topics. In other words, this is simply creating a topic representation for
	certain classes that you might have in your data.

	First, we need to train our model:

	```python
	from bertopic import BERTopic
	from sklearn.datasets import fetch_20newsgroups

	# Prepare data and classes
	data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
	docs = data["data"]
	classes = [data["target_names"][i] for i in data["target"]]

	# Create topic model and calculate topics per class
	topic_model = BERTopic()
	topics, probs = topic_model.fit_transform(docs)
	topics_per_class = topic_model.topics_per_class(docs, classes=classes)
	```

	Then, we visualize the topic representation of major topics per class:

	```python
	topic_model.visualize_topics_per_class(topics_per_class)
	```

	<iframe src="topics_per_class.html" style="width:1400px; height: 1000px; border: 0px;""></iframe>