File size: 6,576 Bytes
19b102a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
The topics that are extracted from BERTopic are represented by words. These words are extracted from the documents 
occupying their topics using a class-based TF-IDF. This allows us to extract words that are interesting to a topic but 
less so to another. 

### **Update Topic Representation after Training**
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stop_words or you want to try out a different n_gram_range. We can use the function `update_topics` to update 
the topic representation with new parameters for `c-TF-IDF`: 

```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

# Create topics
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic(n_gram_range=(2, 3))
topics, probs = topic_model.fit_transform(docs)
```

From the model created above, one of the most frequent topics is the following:

```python
>>> topic_model.get_topic(31)[:10]
[('clipper chip', 0.007240771542316232),
 ('key escrow', 0.004601603973377443),
 ('law enforcement', 0.004277247929596332),
 ('intercon com', 0.0035961920238955824),
 ('amanda walker', 0.003474856425297157),
 ('serial number', 0.0029876119137150358),
 ('com amanda', 0.002789303096817983),
 ('intercon com amanda', 0.0027386688593327084),
 ('amanda intercon', 0.002585262048515583),
 ('amanda intercon com', 0.002585262048515583)]
```

Although there does seems to be some relation between words, it is difficult, at least for me, to intuitively understand 
what the topic is about. Instead, let's simplify the topic representation by setting `n_gram_range` to (1, 3) to 
also allow for single words.

```python
>>> topic_model.update_topics(docs, n_gram_range=(1, 3))
>>> topic_model.get_topic(31)[:10]
[('encryption', 0.008021846079148017),
 ('clipper', 0.00789642647602742),
 ('chip', 0.00637127942464045),
 ('key', 0.006363124787175884),
 ('escrow', 0.005030980365244285),
 ('clipper chip', 0.0048271268437973395),
 ('keys', 0.0043245812747907545),
 ('crypto', 0.004311198708675516),
 ('intercon', 0.0038772934659295076),
 ('amanda', 0.003516026493904586)]
```

To me, the combination of the words above seem a bit more intuitive than the words we previously had! You can play 
around with `n_gram_range` or use your own custom `sklearn.feature_extraction.text.CountVectorizer` and pass that  
instead: 

```python
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 5))
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
```

!!! Tip "Tip!"
    If you want to change the topics to something else, whether that is merging them or removing outliers, you can pass 
    a custom list of topics to update them: `topic_model.update_topics(docs, topics=my_updated_topics)`

### **Custom labels**

The topic labels are currently automatically generated by taking the top 3 words and combining them 
using the `_` separator. Although this is an informative label, in practice, this is definitely not the prettiest nor necessarily the most accurate label. For example, although the topic label 
`1_space_nasa_orbit` is informative, but we would prefer to have a bit more intuitive label, such as 
`space travel`. The difficulty with creating such topic labels is that much of the interpretation is left to the user. Would `space travel` be more accurate or perhaps `space explorations`? To truly understand which labels are most suited, going into some of the documents in topics is especially helpful. 

Although we can go through every single topic ourselves and try to label them, we can start by creating an overview of labels that have the length and number of words that we are looking for. To do so, we can generate our list of topic labels with `.generate_topic_labels` and define the number of words, the separator, word length, etc:

```python
topic_labels = topic_model.generate_topic_labels(nr_words=3,
                                                 topic_prefix=False,
                                                 word_length=10,
                                                 separator=", ")
```

!!! Tip
    If you created [**multiple topic representations**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) or aspects, you can choose one of these aspects with `aspect="Aspect1"` or whatever you named the aspect.

In the above example, `1_space_nasa_orbit` would turn into `space, nasa, orbit` since we selected 3 words, no topic prefix, and the `, ` separator. We can then either change our `topic_labels` to whatever we want or directly pass them to `.set_topic_labels` so that they can be used across most visualization functions:

```python
topic_model.set_topic_labels(topic_labels)
```

It is also possible to only change a few topic labels at a time by passing a dictionary 
where the key represents the *topic ID* and the value is the *topic label*:

```python
topic_model.set_topic_labels({1: "Space Travel", 7: "Religion"})
```

Then, to make use of those custom topic labels across visualizations, such as `.visualize_hierarchy()`, 
we can use the `custom_labels=True` parameter that is found in most visualizations. 

```python
fig = topic_model.visualize_barchart(custom_labels=True)
```

#### Optimize labels
The great advantage of passing custom labels to BERTopic is that when more accurate zero-shot are released, 
we can simply use those on top of BERTopic to further fine-tune the labeling. For example, let's say you 
have a set of potential topic labels that you want to use instead of the ones generated by BERTopic. You could
use the [bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli) model to find which user-defined 
labels best represent the BERTopic-generated labels:


```python
from transformers import pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# A selected topic representation
# 'god jesus atheists atheism belief atheist believe exist beliefs existence'
sequence_to_classify =  " ".join([word for word, _ in topic_model.get_topic(1)])

# Our set of potential topic labels
candidate_labels = ['cooking', 'dancing', 'religion']
classifier(sequence_to_classify, candidate_labels)

#{'labels': ['cooking', 'dancing', 'religion'],
# 'scores': [0.086, 0.063, 0.850],
# 'sequence': 'god jesus atheists atheism belief atheist believe exist beliefs existence'}
```