| ## **Installation** | |
| Installation, with sentence-transformers, can be done using [pypi](https://pypi.org/project/bertopic/): | |
| ```bash | |
| pip install bertopic | |
| ``` | |
| You may want to install more depending on the transformers and language backends that you will be using. | |
| The possible installations are: | |
| ```bash | |
| # Choose an embedding backend | |
| pip install bertopic[flair, gensim, spacy, use] | |
| # Topic modeling with images | |
| pip install bertopic[vision] | |
| ``` | |
| ## **Quick Start** | |
| We start by extracting topics from the well-known 20 newsgroups dataset which is comprised of English documents: | |
| ```python | |
| from bertopic import BERTopic | |
| from sklearn.datasets import fetch_20newsgroups | |
| docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] | |
| topic_model = BERTopic() | |
| topics, probs = topic_model.fit_transform(docs) | |
| ``` | |
| After generating topics, we can access the frequent topics that were generated: | |
| ```python | |
| >>> topic_model.get_topic_info() | |
| Topic Count Name | |
| -1 4630 -1_can_your_will_any | |
| 0 693 49_windows_drive_dos_file | |
| 1 466 32_jesus_bible_christian_faith | |
| 2 441 2_space_launch_orbit_lunar | |
| 3 381 22_key_encryption_keys_encrypted | |
| ``` | |
| -1 refers to all outliers and should typically be ignored. Next, let's take a look at the most | |
| frequent topic that was generated, topic 0: | |
| ```python | |
| >>> topic_model.get_topic(0) | |
| [('windows', 0.006152228076250982), | |
| ('drive', 0.004982897610645755), | |
| ('dos', 0.004845038866360651), | |
| ('file', 0.004140142872194834), | |
| ('disk', 0.004131678774810884), | |
| ('mac', 0.003624848635985097), | |
| ('memory', 0.0034840976976789903), | |
| ('software', 0.0034415334250699077), | |
| ('email', 0.0034239554442333257), | |
| ('pc', 0.003047105930670237)] | |
| ``` | |
| Using `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.: | |
| ```python | |
| >>> topic_model.get_document_info(docs) | |
| Document Topic Name Top_n_words Probability ... | |
| I am sure some bashers of Pens... 0 0_game_team_games_season game - team - games... 0.200010 ... | |
| My brother is in the market for... -1 -1_can_your_will_any can - your - will... 0.420668 ... | |
| Finally you said what you dream... -1 -1_can_your_will_any can - your - will... 0.807259 ... | |
| Think! It is the SCSI card doing... 49 49_windows_drive_dos_file windows - drive - docs... 0.071746 ... | |
| 1) I have an old Jasmine drive... 49 49_windows_drive_dos_file windows - drive - docs... 0.038983 ... | |
| ``` | |
| !!! Tip "Multilingual" | |
| Use `BERTopic(language="multilingual")` to select a model that supports 50+ languages. | |
| ## **Fine-tune Topic Representations** | |
| In BERTopic, there are a number of different [topic representations](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html) that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is `KeyBERTInspired`, which for many users increases the coherence and reduces stopwords from the resulting topic representations: | |
| ```python | |
| from bertopic.representation import KeyBERTInspired | |
| # Fine-tune your topic representations | |
| representation_model = KeyBERTInspired() | |
| topic_model = BERTopic(representation_model=representation_model) | |
| ``` | |
| However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more: | |
| ```python | |
| import openai | |
| from bertopic.representation import OpenAI | |
| # Fine-tune topic representations with GPT | |
| client = openai.OpenAI(api_key="sk-...") | |
| representation_model = OpenAI(client, model="gpt-3.5-turbo", chat=True) | |
| topic_model = BERTopic(representation_model=representation_model) | |
| ``` | |
| !!! tip "Multi-aspect Topic Modeling" | |
| Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) in BERTopic. | |
| ## **Visualizations** | |
| After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good | |
| understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html) in BERTopic. For example, we can visualize the topics that were generated in a way very similar to | |
| [LDAvis](https://github.com/cpsievert/LDAvis): | |
| ```python | |
| topic_model.visualize_topics() | |
| ``` | |
| <iframe src="viz.html" style="width:1000px; height: 680px; border: 0px;""></iframe> | |
| ## **Save/Load BERTopic model** | |
| There are three methods for saving BERTopic: | |
| 1. A light model with `.safetensors` and config files | |
| 2. A light model with pytorch `.bin` and config files | |
| 3. A full model with `.pickle` | |
| Method 3 allows for saving the entire topic model but has several drawbacks: | |
| * Arbitrary code can be run from `.pickle` files | |
| * The resulting model is rather large (often > 500MB) since all sub-models need to be saved | |
| * Explicit and specific version control is needed as they typically only run if the environment is exactly the same | |
| > **It is advised to use methods 1 or 2 for saving.** | |
| These methods have a number of advantages: | |
| * `.safetensors` is a relatively **safe format** | |
| * The resulting model can be **very small** (often < 20MB) since no sub-models need to be saved | |
| * Although version control is important, there is a bit more **flexibility** with respect to specific versions of packages | |
| * More easily used in **production** | |
| * **Share** models with the HuggingFace Hub | |
| !!! Tip "Tip" | |
| For more detail about how to load in a custom vectorizer, representation model, and more, it is highly advised to checkout the [serialization](https://maartengr.github.io/BERTopic/getting_started/serialization/serialization.html) page. It contains more examples, details, and some tips and tricks for loading and saving your environment. | |
| The methods are as used as follows: | |
| ```python | |
| topic_model = BERTopic().fit(my_docs) | |
| # Method 1 - safetensors | |
| embedding_model = "sentence-transformers/all-MiniLM-L6-v2" | |
| topic_model.save("path/to/my/model_dir", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model) | |
| # Method 2 - pytorch | |
| embedding_model = "sentence-transformers/all-MiniLM-L6-v2" | |
| topic_model.save("path/to/my/model_dir", serialization="pytorch", save_ctfidf=True, save_embedding_model=embedding_model) | |
| # Method 3 - pickle | |
| topic_model.save("my_model", serialization="pickle") | |
| ``` | |
| To load a model: | |
| ```python | |
| # Load from directory | |
| loaded_model = BERTopic.load("path/to/my/model_dir") | |
| # Load from file | |
| loaded_model = BERTopic.load("my_model") | |
| # Load from HuggingFace | |
| loaded_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia") | |
| ``` | |
| !!! Warning "Warning" | |
| When saving the model, make sure to also keep track of the versions of dependencies and Python used. | |
| Loading and saving the model should be done using the same dependencies and Python. Moreover, models | |
| saved in one version of BERTopic should not be loaded in other versions. | |