File size: 6,576 Bytes
19b102a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
The topics that are extracted from BERTopic are represented by words. These words are extracted from the documents
occupying their topics using a class-based TF-IDF. This allows us to extract words that are interesting to a topic but
less so to another.
### **Update Topic Representation after Training**
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stop_words or you want to try out a different n_gram_range. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Create topics
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic(n_gram_range=(2, 3))
topics, probs = topic_model.fit_transform(docs)
```
From the model created above, one of the most frequent topics is the following:
```python
>>> topic_model.get_topic(31)[:10]
[('clipper chip', 0.007240771542316232),
('key escrow', 0.004601603973377443),
('law enforcement', 0.004277247929596332),
('intercon com', 0.0035961920238955824),
('amanda walker', 0.003474856425297157),
('serial number', 0.0029876119137150358),
('com amanda', 0.002789303096817983),
('intercon com amanda', 0.0027386688593327084),
('amanda intercon', 0.002585262048515583),
('amanda intercon com', 0.002585262048515583)]
```
Although there does seems to be some relation between words, it is difficult, at least for me, to intuitively understand
what the topic is about. Instead, let's simplify the topic representation by setting `n_gram_range` to (1, 3) to
also allow for single words.
```python
>>> topic_model.update_topics(docs, n_gram_range=(1, 3))
>>> topic_model.get_topic(31)[:10]
[('encryption', 0.008021846079148017),
('clipper', 0.00789642647602742),
('chip', 0.00637127942464045),
('key', 0.006363124787175884),
('escrow', 0.005030980365244285),
('clipper chip', 0.0048271268437973395),
('keys', 0.0043245812747907545),
('crypto', 0.004311198708675516),
('intercon', 0.0038772934659295076),
('amanda', 0.003516026493904586)]
```
To me, the combination of the words above seem a bit more intuitive than the words we previously had! You can play
around with `n_gram_range` or use your own custom `sklearn.feature_extraction.text.CountVectorizer` and pass that
instead:
```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 5))
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
```
!!! Tip "Tip!"
If you want to change the topics to something else, whether that is merging them or removing outliers, you can pass
a custom list of topics to update them: `topic_model.update_topics(docs, topics=my_updated_topics)`
### **Custom labels**
The topic labels are currently automatically generated by taking the top 3 words and combining them
using the `_` separator. Although this is an informative label, in practice, this is definitely not the prettiest nor necessarily the most accurate label. For example, although the topic label
`1_space_nasa_orbit` is informative, but we would prefer to have a bit more intuitive label, such as
`space travel`. The difficulty with creating such topic labels is that much of the interpretation is left to the user. Would `space travel` be more accurate or perhaps `space explorations`? To truly understand which labels are most suited, going into some of the documents in topics is especially helpful.
Although we can go through every single topic ourselves and try to label them, we can start by creating an overview of labels that have the length and number of words that we are looking for. To do so, we can generate our list of topic labels with `.generate_topic_labels` and define the number of words, the separator, word length, etc:
```python
topic_labels = topic_model.generate_topic_labels(nr_words=3,
topic_prefix=False,
word_length=10,
separator=", ")
```
!!! Tip
If you created [**multiple topic representations**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) or aspects, you can choose one of these aspects with `aspect="Aspect1"` or whatever you named the aspect.
In the above example, `1_space_nasa_orbit` would turn into `space, nasa, orbit` since we selected 3 words, no topic prefix, and the `, ` separator. We can then either change our `topic_labels` to whatever we want or directly pass them to `.set_topic_labels` so that they can be used across most visualization functions:
```python
topic_model.set_topic_labels(topic_labels)
```
It is also possible to only change a few topic labels at a time by passing a dictionary
where the key represents the *topic ID* and the value is the *topic label*:
```python
topic_model.set_topic_labels({1: "Space Travel", 7: "Religion"})
```
Then, to make use of those custom topic labels across visualizations, such as `.visualize_hierarchy()`,
we can use the `custom_labels=True` parameter that is found in most visualizations.
```python
fig = topic_model.visualize_barchart(custom_labels=True)
```
#### Optimize labels
The great advantage of passing custom labels to BERTopic is that when more accurate zero-shot are released,
we can simply use those on top of BERTopic to further fine-tune the labeling. For example, let's say you
have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could
use the [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) model to find which user-defined
labels best represent the BERTopic-generated labels:
```python
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
# A selected topic representation
# 'god jesus atheists atheism belief atheist believe exist beliefs existence'
sequence_to_classify = " ".join([word for word, _ in topic_model.get_topic(1)])
# Our set of potential topic labels
candidate_labels = ['cooking', 'dancing', 'religion']
classifier(sequence_to_classify, candidate_labels)
#{'labels': ['cooking', 'dancing', 'religion'],
# 'scores': [0.086, 0.063, 0.850],
# 'sequence': 'god jesus atheists atheism belief atheist believe exist beliefs existence'}
```
|