kisejin's picture
Upload 261 files
19b102a verified
Visualizing BERTopic and its derivatives is important in understanding the model, how it works, and more importantly, where it works.
Since topic modeling can be quite a subjective field it is difficult for users to validate their models. Looking at the topics and seeing
if they make sense is an important factor in alleviating this issue.
## **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through hundreds of topics to get a good
understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis).
We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using
plotly such that we can create an interactive view.
First, we need to train our model:
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
```
Then, we can call `.visualize_topics` to create a 2D representation of your topics. The resulting graph is a
plotly interactive graph which can be converted to HTML:
```python
topic_model.visualize_topics()
```
<iframe src="viz.html" style="width:1000px; height: 680px; border: 0px;""></iframe>
You can use the slider to select the topic which then lights up red. If you hover over a topic, then general
information is given about the topic, including the size of the topic and its corresponding words.
## **Visualize Documents**
Using the previous method, we can visualize the topics and get insight into their relationships. However,
you might want a more fine-grained approach where we can visualize the documents inside the topics to see
if they were assigned correctly or whether they make sense. To do so, we can use the `topic_model.visualize_documents()`
function. This function recalculates the document embeddings and reduces them to 2-dimensional space for easier visualization
purposes. This process can be quite expensive, so it is advised to adhere to the following pipeline:
```python
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic
topic_model = BERTopic().fit(docs, embeddings)
# Run the visualization with the original embeddings
topic_model.visualize_documents(docs, embeddings=embeddings)
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(docs, reduced_embeddings=reduced_embeddings)
```
<iframe src="documents.html" style="width:1200px; height: 800px; border: 0px;""></iframe>
!!! note
The visualization above was generated with the additional parameter `hide_document_hover=True` which disables the
option to hover over the individual points and see the content of the documents. This was done for demonstration purposes
as saving all those documents in the visualization can be quite expensive and result in large files. However,
it might be interesting to set `hide_document_hover=False` in order to hover over the points and see the content of the documents.
### **Custom Hover**
When you visualize the documents, you might not always want to see the complete document over hover. Many documents have shorter information that might be more interesting to visualize, such as its title. To create the hover based on a documents' title instead of its content, you can simply pass a variable (`titles`) containing the title for each document:
```python
topic_model.visualize_documents(titles, reduced_embeddings=reduced_embeddings)
```
## **Visualize Topic Hierarchy**
The topics that were created can be hierarchically reduced. In order to understand the potential hierarchical
structure of the topics, we can use `scipy.cluster.hierarchy` to create clusters and visualize how
they relate to one another. This might help to select an appropriate `nr_topics` when reducing the number
of topics that you have created. To visualize this hierarchy, run the following:
```python
topic_model.visualize_hierarchy()
```
<iframe src="hierarchy.html" style="width:1000px; height: 680px; border: 0px;""></iframe>
!!! note
Do note that this is not the actual procedure of `.reduce_topics()` when `nr_topics` is set to
auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles
the actual procedure of `.reduce_topics()` when any number of `nr_topics` is selected.
### **Hierarchical labels**
Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens
to the topic representations when merging topics. To do so, we first need to calculate the representations of the
hierarchical topics:
First, we train a basic BERTopic model:
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"]
topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(docs)
hierarchical_topics = topic_model.hierarchical_topics(docs)
```
To visualize these results, we simply need to pass the resulting `hierarchical_topics` to our `.visualize_hierarchy` function:
```python
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)
```
<iframe src="hierarchical_topics.html" style="width:1000px; height: 2150px; border: 0px;""></iframe>
If you **hover** over the black circles, you will see the topic representation at that level of the hierarchy. These representations
help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover,
we can now see which sub-topics can be found within certain larger themes.
### **Text-based topic tree**
Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can
use `topic_model.get_topic_tree` to create a text-based representation of this hierarchy. Although the general structure is more difficult
to view, we can see better which topics could be logically merged:
```python
>>> tree = topic_model.get_topic_tree(hierarchical_topics)
>>> print(tree)
.
└─atheists_atheism_god_moral_atheist
β”œβ”€atheists_atheism_god_atheist_argument
β”‚ β”œβ”€β– β”€β”€atheists_atheism_god_atheist_argument ── Topic: 21
β”‚ └─■──br_god_exist_genetic_existence ── Topic: 124
└─■──moral_morality_objective_immoral_morals ── Topic: 29
```
<details>
<summary>Click here to view the full tree.</summary>
```bash
.
β”œβ”€people_armenian_said_god_armenians
β”‚ β”œβ”€god_jesus_jehovah_lord_christ
β”‚ β”‚ β”œβ”€god_jesus_jehovah_lord_christ
β”‚ β”‚ β”‚ β”œβ”€jehovah_lord_mormon_mcconkie_god
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€ra_satan_thou_god_lucifer ── Topic: 94
β”‚ β”‚ β”‚ β”‚ └─■──jehovah_lord_mormon_mcconkie_unto ── Topic: 78
β”‚ β”‚ β”‚ └─jesus_mary_god_hell_sin
β”‚ β”‚ β”‚ β”œβ”€jesus_hell_god_eternal_heaven
β”‚ β”‚ β”‚ β”‚ β”œβ”€hell_jesus_eternal_god_heaven
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€jesus_tomb_disciples_resurrection_john ── Topic: 69
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──hell_eternal_god_jesus_heaven ── Topic: 53
β”‚ β”‚ β”‚ β”‚ └─■──aaron_baptism_sin_law_god ── Topic: 89
β”‚ β”‚ β”‚ └─■──mary_sin_maria_priest_conception ── Topic: 56
β”‚ β”‚ └─■──marriage_married_marry_ceremony_marriages ── Topic: 110
β”‚ └─people_armenian_armenians_said_mr
β”‚ β”œβ”€people_armenian_armenians_said_israel
β”‚ β”‚ β”œβ”€god_homosexual_homosexuality_atheists_sex
β”‚ β”‚ β”‚ β”œβ”€homosexual_homosexuality_sex_gay_homosexuals
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€kinsey_sex_gay_men_sexual ── Topic: 44
β”‚ β”‚ β”‚ β”‚ └─homosexuality_homosexual_sin_homosexuals_gay
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€gay_homosexual_homosexuals_sexual_cramer ── Topic: 50
β”‚ β”‚ β”‚ β”‚ └─■──homosexuality_homosexual_sin_paul_sex ── Topic: 27
β”‚ β”‚ β”‚ └─god_atheists_atheism_moral_atheist
β”‚ β”‚ β”‚ β”œβ”€islam_quran_judas_islamic_book
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€jim_context_challenges_articles_quote ── Topic: 36
β”‚ β”‚ β”‚ β”‚ └─islam_quran_judas_islamic_book
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€islam_quran_islamic_rushdie_muslims ── Topic: 31
β”‚ β”‚ β”‚ β”‚ └─■──judas_scripture_bible_books_greek ── Topic: 33
β”‚ β”‚ β”‚ └─atheists_atheism_god_moral_atheist
β”‚ β”‚ β”‚ β”œβ”€atheists_atheism_god_atheist_argument
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€atheists_atheism_god_atheist_argument ── Topic: 21
β”‚ β”‚ β”‚ β”‚ └─■──br_god_exist_genetic_existence ── Topic: 124
β”‚ β”‚ β”‚ └─■──moral_morality_objective_immoral_morals ── Topic: 29
β”‚ β”‚ └─armenian_armenians_people_israel_said
β”‚ β”‚ β”œβ”€armenian_armenians_israel_people_jews
β”‚ β”‚ β”‚ β”œβ”€tax_rights_government_income_taxes
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€rights_right_slavery_slaves_residence ── Topic: 106
β”‚ β”‚ β”‚ β”‚ └─tax_government_taxes_income_libertarians
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€government_libertarians_libertarian_regulation_party ── Topic: 58
β”‚ β”‚ β”‚ β”‚ └─■──tax_taxes_income_billion_deficit ── Topic: 41
β”‚ β”‚ β”‚ └─armenian_armenians_israel_people_jews
β”‚ β”‚ β”‚ β”œβ”€gun_guns_militia_firearms_amendment
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€blacks_penalty_death_cruel_punishment ── Topic: 55
β”‚ β”‚ β”‚ β”‚ └─■──gun_guns_militia_firearms_amendment ── Topic: 7
β”‚ β”‚ β”‚ └─armenian_armenians_israel_jews_turkish
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€israel_israeli_jews_arab_jewish ── Topic: 4
β”‚ β”‚ β”‚ └─■──armenian_armenians_turkish_armenia_azerbaijan ── Topic: 15
β”‚ β”‚ └─stephanopoulos_president_mr_myers_ms
β”‚ β”‚ β”œβ”€β– β”€β”€serbs_muslims_stephanopoulos_mr_bosnia ── Topic: 35
β”‚ β”‚ └─■──myers_stephanopoulos_president_ms_mr ── Topic: 87
β”‚ └─batf_fbi_koresh_compound_gas
β”‚ β”œβ”€β– β”€β”€reno_workers_janet_clinton_waco ── Topic: 77
β”‚ └─batf_fbi_koresh_gas_compound
β”‚ β”œβ”€batf_koresh_fbi_warrant_compound
β”‚ β”‚ β”œβ”€β– β”€β”€batf_warrant_raid_compound_fbi ── Topic: 42
β”‚ β”‚ └─■──koresh_batf_fbi_children_compound ── Topic: 61
β”‚ └─■──fbi_gas_tear_bds_building ── Topic: 23
└─use_like_just_dont_new
β”œβ”€game_team_year_games_like
β”‚ β”œβ”€game_team_games_25_year
β”‚ β”‚ β”œβ”€game_team_games_25_season
β”‚ β”‚ β”‚ β”œβ”€window_printer_use_problem_mhz
β”‚ β”‚ β”‚ β”‚ β”œβ”€mhz_wire_simms_wiring_battery
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€simms_mhz_battery_cpu_heat
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€simms_pds_simm_vram_lc
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€pds_nubus_lc_slot_card ── Topic: 119
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──simms_simm_vram_meg_dram ── Topic: 32
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─mhz_battery_cpu_heat_speed
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€mhz_cpu_speed_heat_fan
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€mhz_cpu_speed_heat_fan
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€fan_cpu_heat_sink_fans ── Topic: 92
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──mhz_speed_cpu_fpu_clock ── Topic: 22
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──monitor_turn_power_computer_electricity ── Topic: 91
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─battery_batteries_concrete_duo_discharge
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€duo_battery_apple_230_problem ── Topic: 121
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──battery_batteries_concrete_discharge_temperature ── Topic: 75
β”‚ β”‚ β”‚ β”‚ β”‚ └─wire_wiring_ground_neutral_outlets
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€wire_wiring_ground_neutral_outlets
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€wire_wiring_ground_neutral_outlets
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€leds_uv_blue_light_boards ── Topic: 66
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──wire_wiring_ground_neutral_outlets ── Topic: 120
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─scope_scopes_phone_dial_number
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€dial_number_phone_line_output ── Topic: 93
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──scope_scopes_motorola_generator_oscilloscope ── Topic: 113
β”‚ β”‚ β”‚ β”‚ β”‚ └─celp_dsp_sampling_antenna_digital
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€antenna_antennas_receiver_cable_transmitter ── Topic: 70
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──celp_dsp_sampling_speech_voice ── Topic: 52
β”‚ β”‚ β”‚ β”‚ └─window_printer_xv_mouse_windows
β”‚ β”‚ β”‚ β”‚ β”œβ”€window_xv_error_widget_problem
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€error_symbol_undefined_xterm_rx
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€symbol_error_undefined_doug_parse ── Topic: 63
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──rx_remote_server_xdm_xterm ── Topic: 45
β”‚ β”‚ β”‚ β”‚ β”‚ └─window_xv_widget_application_expose
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€window_widget_expose_application_event
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€gc_mydisplay_draw_gxxor_drawing ── Topic: 103
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──window_widget_application_expose_event ── Topic: 25
β”‚ β”‚ β”‚ β”‚ β”‚ └─xv_den_polygon_points_algorithm
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€den_polygon_points_algorithm_polygons ── Topic: 28
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──xv_24bit_image_bit_images ── Topic: 57
β”‚ β”‚ β”‚ β”‚ └─printer_fonts_print_mouse_postscript
β”‚ β”‚ β”‚ β”‚ β”œβ”€printer_fonts_print_font_deskjet
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€scanner_logitech_grayscale_ocr_scanman ── Topic: 108
β”‚ β”‚ β”‚ β”‚ β”‚ └─printer_fonts_print_font_deskjet
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€printer_print_deskjet_hp_ink ── Topic: 18
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──fonts_font_truetype_tt_atm ── Topic: 49
β”‚ β”‚ β”‚ β”‚ └─mouse_ghostscript_midi_driver_postscript
β”‚ β”‚ β”‚ β”‚ β”œβ”€ghostscript_midi_postscript_files_file
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€ghostscript_postscript_pageview_ghostview_dsc ── Topic: 104
β”‚ β”‚ β”‚ β”‚ β”‚ └─midi_sound_file_windows_driver
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€location_mar_file_host_rwrr ── Topic: 83
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──midi_sound_driver_blaster_soundblaster ── Topic: 98
β”‚ β”‚ β”‚ β”‚ └─■──mouse_driver_mice_ball_problem ── Topic: 68
β”‚ β”‚ β”‚ └─game_team_games_25_season
β”‚ β”‚ β”‚ β”œβ”€1st_sale_condition_comics_hulk
β”‚ β”‚ β”‚ β”‚ β”œβ”€sale_condition_offer_asking_cd
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€condition_stereo_amp_speakers_asking
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€miles_car_amfm_toyota_cassette ── Topic: 62
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──amp_speakers_condition_stereo_audio ── Topic: 24
β”‚ β”‚ β”‚ β”‚ β”‚ └─games_sale_pom_cds_shipping
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€pom_cds_sale_shipping_cd
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€size_shipping_sale_condition_mattress ── Topic: 100
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──pom_cds_cd_sale_picture ── Topic: 37
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──games_game_snes_sega_genesis ── Topic: 40
β”‚ β”‚ β”‚ β”‚ └─1st_hulk_comics_art_appears
β”‚ β”‚ β”‚ β”‚ β”œβ”€1st_hulk_comics_art_appears
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€lens_tape_camera_backup_lenses
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€tape_backup_tapes_drive_4mm ── Topic: 107
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──lens_camera_lenses_zoom_pouch ── Topic: 114
β”‚ β”‚ β”‚ β”‚ β”‚ └─1st_hulk_comics_art_appears
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€1st_hulk_comics_art_appears ── Topic: 105
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──books_book_cover_trek_chemistry ── Topic: 125
β”‚ β”‚ β”‚ β”‚ └─tickets_hotel_ticket_voucher_package
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€hotel_voucher_package_vacation_room ── Topic: 74
β”‚ β”‚ β”‚ β”‚ └─■──tickets_ticket_june_airlines_july ── Topic: 84
β”‚ β”‚ β”‚ └─game_team_games_season_hockey
β”‚ β”‚ β”‚ β”œβ”€game_hockey_team_25_550
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€espn_pt_pts_game_la ── Topic: 17
β”‚ β”‚ β”‚ β”‚ └─■──team_25_game_hockey_550 ── Topic: 2
β”‚ β”‚ β”‚ └─■──year_game_hit_baseball_players ── Topic: 0
β”‚ β”‚ └─bike_car_greek_insurance_msg
β”‚ β”‚ β”œβ”€car_bike_insurance_cars_engine
β”‚ β”‚ β”‚ β”œβ”€car_insurance_cars_radar_engine
β”‚ β”‚ β”‚ β”‚ β”œβ”€insurance_health_private_care_canada
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€insurance_health_private_care_canada ── Topic: 99
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──insurance_car_accident_rates_sue ── Topic: 82
β”‚ β”‚ β”‚ β”‚ └─car_cars_radar_engine_detector
β”‚ β”‚ β”‚ β”‚ β”œβ”€car_radar_cars_detector_engine
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€radar_detector_detectors_ka_alarm ── Topic: 39
β”‚ β”‚ β”‚ β”‚ β”‚ └─car_cars_mustang_ford_engine
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€clutch_shift_shifting_transmission_gear ── Topic: 88
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──car_cars_mustang_ford_v8 ── Topic: 14
β”‚ β”‚ β”‚ β”‚ └─oil_diesel_odometer_diesels_car
β”‚ β”‚ β”‚ β”‚ β”œβ”€odometer_oil_sensor_car_drain
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€odometer_sensor_speedo_gauge_mileage ── Topic: 96
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──oil_drain_car_leaks_taillights ── Topic: 102
β”‚ β”‚ β”‚ β”‚ └─■──diesel_diesels_emissions_fuel_oil ── Topic: 79
β”‚ β”‚ β”‚ └─bike_riding_ride_bikes_motorcycle
β”‚ β”‚ β”‚ β”œβ”€bike_ride_riding_bikes_lane
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€bike_ride_riding_lane_car ── Topic: 11
β”‚ β”‚ β”‚ β”‚ └─■──bike_bikes_miles_honda_motorcycle ── Topic: 19
β”‚ β”‚ β”‚ └─■──countersteering_bike_motorcycle_rear_shaft ── Topic: 46
β”‚ β”‚ └─greek_msg_kuwait_greece_water
β”‚ β”‚ β”œβ”€greek_msg_kuwait_greece_water
β”‚ β”‚ β”‚ β”œβ”€greek_msg_kuwait_greece_dog
β”‚ β”‚ β”‚ β”‚ β”œβ”€greek_msg_kuwait_greece_dog
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€greek_kuwait_greece_turkish_greeks
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€greek_greece_turkish_greeks_cyprus ── Topic: 71
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──kuwait_iraq_iran_gulf_arabia ── Topic: 76
β”‚ β”‚ β”‚ β”‚ β”‚ └─msg_dog_drugs_drug_food
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€dog_dogs_cooper_trial_weaver
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€clinton_bush_quayle_reagan_panicking ── Topic: 101
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─dog_dogs_cooper_trial_weaver
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€cooper_trial_weaver_spence_witnesses ── Topic: 90
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──dog_dogs_bike_trained_springer ── Topic: 67
β”‚ β”‚ β”‚ β”‚ β”‚ └─msg_drugs_drug_food_chinese
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€msg_food_chinese_foods_taste ── Topic: 30
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──drugs_drug_marijuana_cocaine_alcohol ── Topic: 72
β”‚ β”‚ β”‚ β”‚ └─water_theory_universe_science_larsons
β”‚ β”‚ β”‚ β”‚ β”œβ”€water_nuclear_cooling_steam_dept
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€rocketry_rockets_engines_nuclear_plutonium ── Topic: 115
β”‚ β”‚ β”‚ β”‚ β”‚ └─water_cooling_steam_dept_plants
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€water_dept_phd_environmental_atmospheric ── Topic: 97
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──cooling_water_steam_towers_plants ── Topic: 109
β”‚ β”‚ β”‚ β”‚ └─theory_universe_larsons_larson_science
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€theory_universe_larsons_larson_science ── Topic: 54
β”‚ β”‚ β”‚ β”‚ └─■──oort_cloud_grbs_gamma_burst ── Topic: 80
β”‚ β”‚ β”‚ └─helmet_kirlian_photography_lock_wax
β”‚ β”‚ β”‚ β”œβ”€helmet_kirlian_photography_leaf_mask
β”‚ β”‚ β”‚ β”‚ β”œβ”€kirlian_photography_leaf_pictures_deleted
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€deleted_joke_stuff_maddi_nickname
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€joke_maddi_nickname_nicknames_frank ── Topic: 43
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ └─■──deleted_stuff_bookstore_joke_motto ── Topic: 81
β”‚ β”‚ β”‚ β”‚ β”‚ └─■──kirlian_photography_leaf_pictures_aura ── Topic: 85
β”‚ β”‚ β”‚ β”‚ └─helmet_mask_liner_foam_cb
β”‚ β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€helmet_liner_foam_cb_helmets ── Topic: 112
β”‚ β”‚ β”‚ β”‚ └─■──mask_goalies_77_santore_tl ── Topic: 123
β”‚ β”‚ β”‚ └─lock_wax_paint_plastic_ear
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€lock_cable_locks_bike_600 ── Topic: 117
β”‚ β”‚ β”‚ └─wax_paint_ear_plastic_skin
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€wax_paint_plastic_scratches_solvent ── Topic: 65
β”‚ β”‚ β”‚ └─■──ear_wax_skin_greasy_acne ── Topic: 116
β”‚ β”‚ └─m4_mp_14_mw_mo
β”‚ β”‚ β”œβ”€m4_mp_14_mw_mo
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€m4_mp_14_mw_mo ── Topic: 111
β”‚ β”‚ β”‚ └─■──test_ensign_nameless_deane_deanebinahccbrandeisedu ── Topic: 118
β”‚ β”‚ └─■──ites_cheek_hello_hi_ken ── Topic: 3
β”‚ └─space_medical_health_disease_cancer
β”‚ β”œβ”€medical_health_disease_cancer_patients
β”‚ β”‚ β”œβ”€β– β”€β”€cancer_centers_center_medical_research ── Topic: 122
β”‚ β”‚ └─health_medical_disease_patients_hiv
β”‚ β”‚ β”œβ”€patients_medical_disease_candida_health
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€candida_yeast_infection_gonorrhea_infections ── Topic: 48
β”‚ β”‚ β”‚ └─patients_disease_cancer_medical_doctor
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€hiv_medical_cancer_patients_doctor ── Topic: 34
β”‚ β”‚ β”‚ └─■──pain_drug_patients_disease_diet ── Topic: 26
β”‚ β”‚ └─■──health_newsgroup_tobacco_vote_votes ── Topic: 9
β”‚ └─space_launch_nasa_shuttle_orbit
β”‚ β”œβ”€space_moon_station_nasa_launch
β”‚ β”‚ β”œβ”€β– β”€β”€sky_advertising_billboard_billboards_space ── Topic: 59
β”‚ β”‚ └─■──space_station_moon_redesign_nasa ── Topic: 16
β”‚ └─space_mission_hst_launch_orbit
β”‚ β”œβ”€space_launch_nasa_orbit_propulsion
β”‚ β”‚ β”œβ”€β– β”€β”€space_launch_nasa_propulsion_astronaut ── Topic: 47
β”‚ β”‚ └─■──orbit_km_jupiter_probe_earth ── Topic: 86
β”‚ └─■──hst_mission_shuttle_orbit_arrays ── Topic: 60
└─drive_file_key_windows_use
β”œβ”€key_file_jpeg_encryption_image
β”‚ β”œβ”€key_encryption_clipper_chip_keys
β”‚ β”‚ β”œβ”€β– β”€β”€key_clipper_encryption_chip_keys ── Topic: 1
β”‚ β”‚ └─■──entry_file_ripem_entries_key ── Topic: 73
β”‚ └─jpeg_image_file_gif_images
β”‚ β”œβ”€motif_graphics_ftp_available_3d
β”‚ β”‚ β”œβ”€motif_graphics_openwindows_ftp_available
β”‚ β”‚ β”‚ β”œβ”€β– β”€β”€openwindows_motif_xview_windows_mouse ── Topic: 20
β”‚ β”‚ β”‚ └─■──graphics_widget_ray_3d_available ── Topic: 95
β”‚ β”‚ └─■──3d_machines_version_comments_contact ── Topic: 38
β”‚ └─jpeg_image_gif_images_format
β”‚ β”œβ”€β– β”€β”€gopher_ftp_files_stuffit_images ── Topic: 51
β”‚ └─■──jpeg_image_gif_format_images ── Topic: 13
└─drive_db_card_scsi_windows
β”œβ”€db_windows_dos_mov_os2
β”‚ β”œβ”€β– β”€β”€copy_protection_program_software_disk ── Topic: 64
β”‚ └─■──db_windows_dos_mov_os2 ── Topic: 8
└─drive_card_scsi_drives_ide
β”œβ”€drive_scsi_drives_ide_disk
β”‚ β”œβ”€β– β”€β”€drive_scsi_drives_ide_disk ── Topic: 6
β”‚ └─■──meg_sale_ram_drive_shipping ── Topic: 12
└─card_modem_monitor_video_drivers
β”œβ”€β– β”€β”€card_monitor_video_drivers_vga ── Topic: 5
└─■──modem_port_serial_irq_com ── Topic: 10
```
</details>
## **Visualize Hierarchical Documents**
We can extend the previous method by calculating the topic representation at different levels of the hierarchy and
plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics:
```python
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
# Prepare embeddings
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(docs, show_progress_bar=False)
# Train BERTopic and extract hierarchical topics
topic_model = BERTopic().fit(docs, embeddings)
hierarchical_topics = topic_model.hierarchical_topics(docs)
```
Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by
reducing their dimensionality ourselves:
```python
# Run the visualization with the original embeddings
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings)
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings)
```
<iframe src="hierarchical_documents.html" style="width:1200px; height: 800px; border: 0px;""></iframe>
!!! note
The visualization above was generated with the additional parameter `hide_document_hover=True` which disables the
option to hover over the individual points and see the content of the documents. This makes the resulting visualization
smaller and fit into your RAM. However, it might be interesting to set `hide_document_hover=False` to hover
over the points and see the content of the documents.
## **Visualize Terms**
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores
for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within
topics. Moreover, you can easily compare topic representations to each other.
To visualize this hierarchy, run the following:
```python
topic_model.visualize_barchart()
```
<iframe src="bar_chart.html" style="width:1100px; height: 660px; border: 0px;""></iframe>
## **Visualize Topic Similarity**
Having generated topic embeddings, through both c-TF-IDF and embeddings, we can create a similarity
matrix by simply applying cosine similarities through those topic embeddings. The result will be a
matrix indicating how similar certain topics are to each other.
To visualize the heatmap, run the following:
```python
topic_model.visualize_heatmap()
```
<iframe src="heatmap.html" style="width:1000px; height: 720px; border: 0px;""></iframe>
!!! note
You can set `n_clusters` in `visualize_heatmap` to order the topics by their similarity.
This will result in blocks being formed in the heatmap indicating which clusters of topics are
similar to each other. This step is very much recommended as it will make reading the heatmap easier.
## **Visualize Term Score Decline**
Topics are represented by a number of words starting with the best representative word.
Each word is represented by a c-TF-IDF score. The higher the score, the more representative a word
to the topic is. Since the topic words are sorted by their c-TF-IDF score, the scores slowly decline
with each word that is added. At some point adding words to the topic representation only marginally
increases the total c-TF-IDF score and would not be beneficial for its representation.
To visualize this effect, we can plot the c-TF-IDF scores for each topic by the term rank of each word.
In other words, the position of the words (term rank), where the words with
the highest c-TF-IDF score will have a rank of 1, will be put on the x-axis. Whereas the y-axis
will be populated by the c-TF-IDF scores. The result is a visualization that shows you the decline
of c-TF-IDF score when adding words to the topic representation. It allows you, using the elbow method,
the select the best number of words in a topic.
To visualize the c-TF-IDF score decline, run the following:
```python
topic_model.visualize_term_rank()
```
<iframe src="term_rank.html" style="width:1000px; height: 530px; border: 0px;""></iframe>
To enable the log scale on the y-axis for a better view of individual topics, run the following:
```python
topic_model.visualize_term_rank(log_scale=True)
```
<iframe src="term_rank_log.html" style="width:1000px; height: 530px; border: 0px;""></iframe>
This visualization was heavily inspired by the "Term Probability Decline" visualization found in an
analysis by the amazing [tmtoolkit](https://tmtoolkit.readthedocs.io/).
Reference to that specific analysis can be found
[here](https://wzbsocialsciencecenter.github.io/tm_corona/tm_analysis.html).
## **Visualize Topics over Time**
After creating topics over time with Dynamic Topic Modeling, we can visualize these topics by
leveraging the interactive abilities of Plotly. Plotly allows us to show the frequency
of topics over time whilst giving the option of hovering over the points to show the time-specific topic representations.
Simply call `.visualize_topics_over_time` with the newly created topics over time:
```python
import re
import pandas as pd
from bertopic import BERTopic
# Prepare data
trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()
# Create topics over time
model = BERTopic(verbose=True)
topics, probs = model.fit_transform(tweets)
topics_over_time = model.topics_over_time(tweets, timestamps)
```
Then, we visualize some interesting topics:
```python
model.visualize_topics_over_time(topics_over_time, topics=[9, 10, 72, 83, 87, 91])
```
<iframe src="trump.html" style="width:1000px; height: 680px; border: 0px;""></iframe>
## **Visualize Topics per Class**
You might want to extract and visualize the topic representation per class. For example, if you have
specific groups of users that might approach topics differently, then extracting them would help understanding
how these users talk about certain topics. In other words, this is simply creating a topic representation for
certain classes that you might have in your data.
First, we need to train our model:
```python
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
# Prepare data and classes
data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs = data["data"]
classes = [data["target_names"][i] for i in data["target"]]
# Create topic model and calculate topics per class
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topics_per_class = topic_model.topics_per_class(docs, classes=classes)
```
Then, we visualize the topic representation of major topics per class:
```python
topic_model.visualize_topics_per_class(topics_per_class)
```
<iframe src="topics_per_class.html" style="width:1400px; height: 1000px; border: 0px;""></iframe>
## **Visualize Probablities or Distribution**
We can generate the topic-document probability matrix by simply setting `calculate_probabilities=True` if a HDBSCAN model is used:
```python
from bertopic import BERTopic
topic_model = BERTopic(calculate_probabilities=True)
topics, probs = topic_model.fit_transform(docs)
```
The resulting `probs` variable contains the soft-clustering as done through HDBSCAN.
If a non-HDBSCAN model is used, we can estimate the topic distributions after training our model:
```python
from bertopic import BERTopic
topic_model = BERTopic()
topics, _ = topic_model.fit_transform(docs)
topic_distr, _ = topic_model.approximate_distribution(docs, min_similarity=0)
```
Then, we either pass the `probs` or `topic_distr` variable to `.visualize_distribution` to visualize either the probability distributions or the topic distributions:
```python
# To visualize the probabilities of topic assignment
topic_model.visualize_distribution(probs[0])
# To visualize the topic distributions in a document
topic_model.visualize_distribution(topic_distr[0])
```
<iframe src="probabilities.html" style="width:1000px; height: 500px; border: 0px;""></iframe>
Although a topic distribution is nice, we may want to see how each token contributes to a specific topic. To do so, we need to first calculate topic distributions on a token level and then visualize the results:
```python
# Calculate the topic distributions on a token-level
topic_distr, topic_token_distr = topic_model.approximate_distribution(docs, calculate_tokens=True)
# Visualize the token-level distributions
df = topic_model.visualize_approximate_distribution(docs[1], topic_token_distr[1])
df
```
<br><br>
<img src="../distribution/distribution.png">
<br><br>
!!! note
To get the stylized dataframe for `.visualize_approximate_distribution` you will need to have Jinja installed. If you do not have this installed, an unstylized dataframe will be returned instead. You can install Jinja via `pip install jinja2`
!!! note
The distribution of the probabilities does not give an indication to
the distribution of the frequencies of topics across a document. It merely shows
how confident BERTopic is that certain topics can be found in a document.