File size: 12,296 Bytes
19b102a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
# Embedding Models
BERTopic starts with transforming our input documents into numerical representations. Although there are many ways this can be achieved, we typically use sentence-transformers (`"all-MiniLM-L6-v2"`) as it is quite capable of capturing the semantic similarity between documents.
However, there is not one perfect
embedding model and you might want to be using something entirely different for your use case. Since BERTopic assumes some independence among steps, we can allow for this modularity:
<figure markdown>

<figcaption></figcaption>
</figure>
This modularity allows us not only to choose any embedding model to convert our documents into numerical representations, we can use essentially any data to perform our clustering.
When new state-of-the-art pre-trained embedding models are released, BERTopic will be able to use them. As a result, BERTopic grows with any new models being released.
Out of the box, BERTopic supports several embedding techniques. In this section, we will go through several of them and how they can be implemented.
### **Sentence Transformers**
You can select any model from sentence-transformers [here](https://www.sbert.net/docs/pretrained_models.html)
and pass it through BERTopic with `embedding_model`:
```python
from bertopic import BERTopic
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
```
Or select a SentenceTransformer model with your parameters:
```python
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(embedding_model=sentence_model)
```
!!! tip "Tip 1!"
This embedding back-end was put here first for a reason, sentence-transformers works amazing out of the box! Playing around with different models can give you great results. Also, make sure to frequently visit [this](https://www.sbert.net/docs/pretrained_models.html) page as new models are often released.
!!! tip "Tip 2!"
New embedding models are released frequently and their performance keeps getting better. To keep track of the best embedding models out there, you can visit the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). It is an excellent place for selecting the embedding that works best for you. For example, if you want the best of the best, then the top 5 models might the place to look.
Many of these models can be used with `SentenceTransformers` in BERTopic, like so:
```python
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
topic_model = BERTopic(embedding_model=embedding_model)
```
### 🤗 Hugging Face Transformers
To use a Hugging Face transformers model, load in a pipeline and point
to any model found on their model hub (https://huggingface.co/models):
```python
from transformers.pipelines import pipeline
embedding_model = pipeline("feature-extraction", model="distilbert-base-cased")
topic_model = BERTopic(embedding_model=embedding_model)
```
!!! tip "Tip!"
These transformers also work quite well using `sentence-transformers` which has great optimizations tricks that make using it a bit faster.
### **Flair**
[Flair](https://github.com/flairNLP/flair) allows you to choose almost any embedding model that
is publicly available. Flair can be used as follows:
```python
from flair.embeddings import TransformerDocumentEmbeddings
roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)
```
You can select any 🤗 transformers model [here](https://huggingface.co/models).
Moreover, you can also use Flair to use word embeddings and pool them to create document embeddings.
Under the hood, Flair simply averages all word embeddings in a document. Then, we can easily
pass it to BERTopic to use those word embeddings as document embeddings:
```python
from flair.embeddings import WordEmbeddings, DocumentPoolEmbeddings
glove_embedding = WordEmbeddings('crawl')
document_glove_embeddings = DocumentPoolEmbeddings([glove_embedding])
topic_model = BERTopic(embedding_model=document_glove_embeddings)
```
### **Spacy**
[Spacy](https://github.com/explosion/spaCy) is an amazing framework for processing text. There are
many models available across many languages for modeling text.
To use Spacy's non-transformer models in BERTopic:
```python
import spacy
nlp = spacy.load("en_core_web_md", exclude=['tagger', 'parser', 'ner',
'attribute_ruler', 'lemmatizer'])
topic_model = BERTopic(embedding_model=nlp)
```
Using spacy-transformer models:
```python
import spacy
spacy.prefer_gpu()
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner',
'attribute_ruler', 'lemmatizer'])
topic_model = BERTopic(embedding_model=nlp)
```
If you run into memory issues with spacy-transformer models, try:
```python
import spacy
from thinc.api import set_gpu_allocator, require_gpu
nlp = spacy.load("en_core_web_trf", exclude=['tagger', 'parser', 'ner',
'attribute_ruler', 'lemmatizer'])
set_gpu_allocator("pytorch")
require_gpu(0)
topic_model = BERTopic(embedding_model=nlp)
```
### **Universal Sentence Encoder (USE)**
The Universal Sentence Encoder encodes text into high-dimensional vectors that are used here
for embedding the documents. The model is trained and optimized for greater-than-word length text,
such as sentences, phrases, or short paragraphs.
Using USE in BERTopic is rather straightforward:
```python
import tensorflow_hub
embedding_model = tensorflow_hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
topic_model = BERTopic(embedding_model=embedding_model)
```
### **Gensim**
BERTopic supports the `gensim.downloader` module, which allows it to download any word embedding model supported by Gensim.
Typically, these are Glove, Word2Vec, or FastText embeddings:
```python
import gensim.downloader as api
ft = api.load('fasttext-wiki-news-subwords-300')
topic_model = BERTopic(embedding_model=ft)
```
!!! tip "Tip!"
Gensim is primarily used for Word Embedding models. This works typically best for short documents since the word embeddings are pooled.
### **Scikit-Learn Embeddings**
Scikit-Learn is a framework for more than just machine learning.
It offers many preprocessing tools, some of which can be used to create representations
for text. Many of these tools are relatively lightweight and do not require a GPU.
While the representations may be less expressive than many BERT models, the fact that
it runs much faster can make it a relevant candidate to consider.
If you have a scikit-learn compatible pipeline that you'd like to use to embed
text then you can also pass this to BERTopic.
```python
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
pipe = make_pipeline(
TfidfVectorizer(),
TruncatedSVD(100)
)
topic_model = BERTopic(embedding_model=pipe)
```
!!! Warning
One caveat to be aware of is that scikit-learns base `Pipeline` class does not
support the `.partial_fit()`-API. If you have a pipeline that theoretically should
be able to support online learning then you might want to explore
the [scikit-partial](https://github.com/koaning/scikit-partial) project.
Moreover, since this backend does not generate representations on a word level,
it does not support the `bertopic.representation` models.
### OpenAI
To use OpenAI's external API, we need to define our key and explicitly call `bertopic.backend.OpenAIBackend`
to be used in our topic model:
```python
import openai
from bertopic.backend import OpenAIBackend
client = openai.OpenAI(api_key="sk-...")
embedding_model = OpenAIBackend(client, "text-embedding-ada-002")
topic_model = BERTopic(embedding_model=embedding_model)
```
### Cohere
To use Cohere's external API, we need to define our key and explicitly call `bertopic.backend.CohereBackend`
to be used in our topic model:
```python
import cohere
from bertopic.backend import CohereBackend
client = cohere.Client("MY_API_KEY")
embedding_model = CohereBackend(client)
topic_model = BERTopic(embedding_model=embedding_model)
```
### Multimodal
To create embeddings for both text and images in the same vector space, we can use the `MultiModalBackend`.
This model uses a clip-vit based model that is capable of embedding text, images, or both:
```python
from bertopic.backend import MultiModalBackend
model = MultiModalBackend('clip-ViT-B-32', batch_size=32)
# Embed documents only
doc_embeddings = model.embed_documents(docs)
# Embeding images only
image_embeddings = model.embed_images(images)
# Embed both images and documents, then average them
doc_image_embeddings = model.embed(docs, images)
```
### **Custom Backend**
If your backend or model cannot be found in the ones currently available, you can use the `bertopic.backend.BaseEmbedder` class to
create your backend. Below, you will find an example of creating a SentenceTransformer backend for BERTopic:
```python
from bertopic.backend import BaseEmbedder
from sentence_transformers import SentenceTransformer
class CustomEmbedder(BaseEmbedder):
def __init__(self, embedding_model):
super().__init__()
self.embedding_model = embedding_model
def embed(self, documents, verbose=False):
embeddings = self.embedding_model.encode(documents, show_progress_bar=verbose)
return embeddings
# Create custom backend
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
custom_embedder = CustomEmbedder(embedding_model=embedding_model)
# Pass custom backend to bertopic
topic_model = BERTopic(embedding_model=custom_embedder)
```
### **Custom Embeddings**
The base models in BERTopic are BERT-based models that work well with document similarity tasks. Your documents,
however, might be too specific for a general pre-trained model to be used. Fortunately, you can use the embedding
model in BERTopic to create document features.
You only need to prepare the document embeddings yourself and pass them through `fit_transform` of BERTopic:
```python
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train our topic model using our pre-trained sentence-transformers embeddings
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs, embeddings)
```
As you can see above, we used a SentenceTransformer model to create the embedding. You could also have used
`🤗 transformers`, `Doc2Vec`, or any other embedding method.
#### **TF-IDF**
As mentioned above, any embedding technique can be used. However, when running UMAP, the typical distance metric is
`cosine` which does not work quite well for a TF-IDF matrix. Instead, BERTopic will recognize that a sparse matrix
is passed and use `hellinger` instead which works quite well for the similarity between probability distributions.
We simply create a TF-IDF matrix and use them as embeddings in our `fit_transform` method:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
# Create TF-IDF sparse matrix
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
vectorizer = TfidfVectorizer(min_df=5)
embeddings = vectorizer.fit_transform(docs)
# Train our topic model using TF-IDF vectors
topic_model = BERTopic(stop_words="english")
topics, probs = topic_model.fit_transform(docs, embeddings)
```
Here, you will probably notice that creating the embeddings is quite fast whereas `fit_transform` is quite slow.
This is to be expected as reducing the dimensionality of a large sparse matrix takes some time. The inverse of using
transformer embeddings is true: creating the embeddings is slow whereas `fit_transform` is quite fast.
|