File size: 10,592 Bytes
19b102a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
One of the core components of BERTopic is its Bag-of-Words representation and weighting with c-TF-IDF. This method is fast and can quickly generate a number of keywords for a topic without depending on the clustering task. As a result, topics can easily and quickly be updated after training the model without the need to re-train it.
Although these give good topic representations, we may want to further fine-tune the topic representations.
As such, there are a number of representation models implemented in BERTopic that allows for further fine-tuning of the topic representations. These are optional
and are **not used by default**. You are not restrained by the how the representation can be fine-tuned, from GPT-like models to fast keyword extraction
with KeyBERT-like models:
<iframe width="1200" height="500" src="https://user-images.githubusercontent.com/25746895/218417067-a81cc179-9055-49ba-a2b0-f2c1db535159.mp4
" title="BERTopic Overview" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
For each model below, an example will be shown on how it may change or improve upon the default topic keywords that are generated. The dataset used in these examples can be found [here](https://www.kaggle.com/datasets/maartengr/kurzgesagt-transcriptions).
If you want to have multiple representations of a single topic, it might be worthwhile to also check out [**multi-aspect**](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) topic modeling with BERTopic.
## **KeyBERTInspired**
After having generated our topics with c-TF-IDF, we might want to do some fine-tuning based on the semantic
relationship between keywords/keyphrases and the set of documents in each topic. Although we can use a centroid-based
technique for this, it can be costly and does not take the structure of a cluster into account. Instead, we leverage
c-TF-IDF to create a set of representative documents per topic and use those as our updated topic embedding. Then, we calculate
the similarity between candidate keywords and the topic embedding using the same embedding model that embedded the documents.
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/keybertinspired.svg"
</div>
<br>
Thus, the algorithm follows some principles of [KeyBERT](https://github.com/MaartenGr/KeyBERT) but does some optimization in
order to speed up inference. Usage is straightforward:
```python
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
# Create your representation model
representation_model = KeyBERTInspired()
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/keybert.svg"
</div>
<br>
## **PartOfSpeech**
Our candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from
all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of
keywords and documents that best represent a topic.
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/partofspeech.svg"
</div>
<br>
More specifically, we find documents that contain the keywords from our candidate topics as calculated with c-TF-IDF. These documents serve
as the representative set of documents from which the Spacy model can extract a set of candidate keywords for each topic.
These candidate keywords are first put through Spacy's POS module to see whether they match with the `DEFAULT_PATTERNS`:
```python
DEFAULT_PATTERNS = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}],
[{'POS': 'ADJ'}]
]
```
These patterns follow Spacy's [Rule-Based Matching](https://spacy.io/usage/rule-based-matching). Then, the resulting keywords are sorted by
their respective c-TF-IDF values.
```python
from bertopic.representation import PartOfSpeech
from bertopic import BERTopic
# Create your representation model
representation_model = PartOfSpeech("en_core_web_sm")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/pos.svg"
</div>
<br>
You can define custom POS patterns to be extracted:
```python
pos_patterns = [
[{'POS': 'ADJ'}, {'POS': 'NOUN'}],
[{'POS': 'NOUN'}], [{'POS': 'ADJ'}]
]
representation_model = PartOfSpeech("en_core_web_sm", pos_patterns=pos_patterns)
```
## **MaximalMarginalRelevance**
When we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like "car" and "cars"
essentially represent the same information and often redundant.
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/mmr.svg"
</div>
<br>
<!-- MMR = arg \underset{D_i\in R\setminus S}{max} [\lambda Sim_{1}(D_{i}, Q) - (1-\lambda) \,\, \underset{D_{j}\in S}{max} \,\, Sim_{2}(D_{i}, D_{j})] -->
To decrease this redundancy and improve the diversity of keywords, we can use an algorithm called Maximal Marginal Relevance (MMR). MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords
that maximize their within diversity with respect to the document.
```python
from bertopic.representation import MaximalMarginalRelevance
from bertopic import BERTopic
# Create your representation model
representation_model = MaximalMarginalRelevance(diversity=0.3)
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/mmr_output.svg"
</div>
<br>
## **Zero-Shot Classification**
For some use cases, you might already have a set of candidate labels that you would like to automatically assign to some of the topics.
Although we can use guided or supervised BERTopic for that, we can also use zero-shot classification to assign labels to our topics.
For that, we can make use of 🤗 transformers on their models on the [model hub](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads).
To perform this classification, we feed the model with the keywords as generated through c-TF-IDF and a set of candidate labels.
If, for a certain topic, we find a similar enough label, then it is assigned. If not, then we keep the original c-TF-IDF keywords.
We use it in BERTopic as follows:
```python
from bertopic.representation import ZeroShotClassification
from bertopic import BERTopic
# Create your representation model
candidate_topics = ["space and nasa", "bicycles", "sports"]
representation_model = ZeroShotClassification(candidate_topics, model="facebook/bart-large-mnli")
# Use the representation model in BERTopic on top of the default pipeline
topic_model = BERTopic(representation_model=representation_model)
```
<br>
<div class="svg_image">
--8<-- "docs/getting_started/representation/zero.svg"
</div>
<br>
## **Chain Models**
All of the above models can make use of the candidate topics, as generated by c-TF-IDF, to further fine-tune the topic representations. For example, `MaximalMarginalRelevance` takes the keywords in the candidate topics and re-ranks them. Similarly, the keywords in the candidate topic can be used as the input for GPT-prompts in `OpenAI`.
Although the default candidate topics are generated by c-TF-IDF, what if we were to chain these models? For example, we can use `MaximalMarginalRelevance` to improve upon the keywords in each topic before passing them to `OpenAI`.
This is supported in BERTopic by simply passing a list of representation models when instantation the topic model:
```python
from bertopic.representation import MaximalMarginalRelevance, OpenAI
from bertopic import BERTopic
import openai
# Create your representation models
client = openai.OpenAI(api_key="sk-...")
openai_generator = OpenAI(client)
mmr = MaximalMarginalRelevance(diversity=0.3)
representation_models = [mmr, openai_generator]
# Use the chained models
topic_model = BERTopic(representation_model=representation_models)
```
## **Custom Model**
Although several representation models have been implemented in BERTopic, new technologies get released often and we should not have to wait until they get implemented in BERTopic. Therefore, you can create your own representation model and use that to fine-tune the topics.
The following is the basic structure for creating your custom model. Note that it returns the same topics as the those
calculated with c-TF-IDF:
```python
from bertopic.representation._base import BaseRepresentation
class CustomRepresentationModel(BaseRepresentation):
def extract_topics(self, topic_model, documents, c_tf_idf, topics
) -> Mapping[str, List[Tuple[str, float]]]:
""" Extract topics
Arguments:
topic_model: The BERTopic model
documents: A dataframe of documents with their related topics
c_tf_idf: The c-TF-IDF matrix
topics: The candidate topics as calculated with c-TF-IDF
Returns:
updated_topics: Updated topic representations
"""
updated_topics = topics.copy()
return updated_topics
```
Then, we can use that model as follows:
```python
from bertopic import BERTopic
# Create our custom representation model
representation_model = CustomRepresentationModel()
# Pass our custom representation model to BERTopic
topic_model = BERTopic(representation_model=representation_model)
```
There are a few things to take into account when creating your custom model:
* It needs to have the exact same parameter input: `topic_model`, `documents`, `c_tf_idf`, `topics`.
* Make sure that `updated_topics` has the exact same structure as `topics`:
```python
updated_topics = {
"1", [("space", 0.9), ("nasa", 0.7)],
"2": [("science", 0.66), ("article", 0.6)]
}
```
!!! Tip
You can change the `__init__` however you want, it does not influence the underlying structure. This
also means that you can save data/embeddings/representations/sentiment in your custom representation
model.
|