| The topics that you create can be hierarchically reduced. In order to understand the potential hierarchical | |
| structure of the topics, we can use `scipy.cluster.hierarchy` to create clusters and visualize how | |
| they relate to one another. This might help to select an appropriate `nr_topics` when reducing the number | |
| of topics that you have created. To visualize this hierarchy, run the following: | |
| ```python | |
| topic_model.visualize_hierarchy() | |
| ``` | |
| <iframe src="hierarchy.html" style="width:1000px; height: 680px; border: 0px;""></iframe> | |
| !!! note | |
| Do note that this is not the actual procedure of `.reduce_topics()` when `nr_topics` is set to | |
| auto since HDBSCAN is used to automatically extract topics. The visualization above closely resembles | |
| the actual procedure of `.reduce_topics()` when any number of `nr_topics` is selected. | |
| ### **Hierarchical labels** | |
| Although visualizing this hierarchy gives us information about the structure, it would be helpful to see what happens | |
| to the topic representations when merging topics. To do so, we first need to calculate the representations of the | |
| hierarchical topics: | |
| First, we train a basic BERTopic model: | |
| ```python | |
| from bertopic import BERTopic | |
| from sklearn.datasets import fetch_20newsgroups | |
| docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))["data"] | |
| topic_model = BERTopic(verbose=True) | |
| topics, probs = topic_model.fit_transform(docs) | |
| hierarchical_topics = topic_model.hierarchical_topics(docs) | |
| ``` | |
| To visualize these results, we simply need to pass the resulting `hierarchical_topics` to our `.visualize_hierarchy` function: | |
| ```python | |
| topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) | |
| ``` | |
| <iframe src="hierarchical_topics.html" style="width:1000px; height: 2150px; border: 0px;""></iframe> | |
| If you **hover** over the black circles, you will see the topic representation at that level of the hierarchy. These representations | |
| help you understand the effect of merging certain topics. Some might be logical to merge whilst others might not. Moreover, | |
| we can now see which sub-topics can be found within certain larger themes. | |
| ### **Text-based topic tree** | |
| Although this gives a nice overview of the potential hierarchy, hovering over all black circles can be tiresome. Instead, we can | |
| use `topic_model.get_topic_tree` to create a text-based representation of this hierarchy. Although the general structure is more difficult | |
| to view, we can see better which topics could be logically merged: | |
| ```python | |
| >>> tree = topic_model.get_topic_tree(hierarchical_topics) | |
| >>> print(tree) | |
| . | |
| ββatheists_atheism_god_moral_atheist | |
| ββatheists_atheism_god_atheist_argument | |
| β βββ ββatheists_atheism_god_atheist_argument ββ Topic: 21 | |
| β βββ ββbr_god_exist_genetic_existence ββ Topic: 124 | |
| βββ ββmoral_morality_objective_immoral_morals ββ Topic: 29 | |
| ``` | |
| <details> | |
| <summary>Click here to view the full tree.</summary> | |
| ```bash | |
| . | |
| ββpeople_armenian_said_god_armenians | |
| β ββgod_jesus_jehovah_lord_christ | |
| β β ββgod_jesus_jehovah_lord_christ | |
| β β β ββjehovah_lord_mormon_mcconkie_god | |
| β β β β βββ ββra_satan_thou_god_lucifer ββ Topic: 94 | |
| β β β β βββ ββjehovah_lord_mormon_mcconkie_unto ββ Topic: 78 | |
| β β β ββjesus_mary_god_hell_sin | |
| β β β ββjesus_hell_god_eternal_heaven | |
| β β β β ββhell_jesus_eternal_god_heaven | |
| β β β β β βββ ββjesus_tomb_disciples_resurrection_john ββ Topic: 69 | |
| β β β β β βββ ββhell_eternal_god_jesus_heaven ββ Topic: 53 | |
| β β β β βββ ββaaron_baptism_sin_law_god ββ Topic: 89 | |
| β β β βββ ββmary_sin_maria_priest_conception ββ Topic: 56 | |
| β β βββ ββmarriage_married_marry_ceremony_marriages ββ Topic: 110 | |
| β ββpeople_armenian_armenians_said_mr | |
| β ββpeople_armenian_armenians_said_israel | |
| β β ββgod_homosexual_homosexuality_atheists_sex | |
| β β β ββhomosexual_homosexuality_sex_gay_homosexuals | |
| β β β β βββ ββkinsey_sex_gay_men_sexual ββ Topic: 44 | |
| β β β β ββhomosexuality_homosexual_sin_homosexuals_gay | |
| β β β β βββ ββgay_homosexual_homosexuals_sexual_cramer ββ Topic: 50 | |
| β β β β βββ ββhomosexuality_homosexual_sin_paul_sex ββ Topic: 27 | |
| β β β ββgod_atheists_atheism_moral_atheist | |
| β β β ββislam_quran_judas_islamic_book | |
| β β β β βββ ββjim_context_challenges_articles_quote ββ Topic: 36 | |
| β β β β ββislam_quran_judas_islamic_book | |
| β β β β βββ ββislam_quran_islamic_rushdie_muslims ββ Topic: 31 | |
| β β β β βββ ββjudas_scripture_bible_books_greek ββ Topic: 33 | |
| β β β ββatheists_atheism_god_moral_atheist | |
| β β β ββatheists_atheism_god_atheist_argument | |
| β β β β βββ ββatheists_atheism_god_atheist_argument ββ Topic: 21 | |
| β β β β βββ ββbr_god_exist_genetic_existence ββ Topic: 124 | |
| β β β βββ ββmoral_morality_objective_immoral_morals ββ Topic: 29 | |
| β β ββarmenian_armenians_people_israel_said | |
| β β ββarmenian_armenians_israel_people_jews | |
| β β β ββtax_rights_government_income_taxes | |
| β β β β βββ ββrights_right_slavery_slaves_residence ββ Topic: 106 | |
| β β β β ββtax_government_taxes_income_libertarians | |
| β β β β βββ ββgovernment_libertarians_libertarian_regulation_party ββ Topic: 58 | |
| β β β β βββ ββtax_taxes_income_billion_deficit ββ Topic: 41 | |
| β β β ββarmenian_armenians_israel_people_jews | |
| β β β ββgun_guns_militia_firearms_amendment | |
| β β β β βββ ββblacks_penalty_death_cruel_punishment ββ Topic: 55 | |
| β β β β βββ ββgun_guns_militia_firearms_amendment ββ Topic: 7 | |
| β β β ββarmenian_armenians_israel_jews_turkish | |
| β β β βββ ββisrael_israeli_jews_arab_jewish ββ Topic: 4 | |
| β β β βββ ββarmenian_armenians_turkish_armenia_azerbaijan ββ Topic: 15 | |
| β β ββstephanopoulos_president_mr_myers_ms | |
| β β βββ ββserbs_muslims_stephanopoulos_mr_bosnia ββ Topic: 35 | |
| β β βββ ββmyers_stephanopoulos_president_ms_mr ββ Topic: 87 | |
| β ββbatf_fbi_koresh_compound_gas | |
| β βββ ββreno_workers_janet_clinton_waco ββ Topic: 77 | |
| β ββbatf_fbi_koresh_gas_compound | |
| β ββbatf_koresh_fbi_warrant_compound | |
| β β βββ ββbatf_warrant_raid_compound_fbi ββ Topic: 42 | |
| β β βββ ββkoresh_batf_fbi_children_compound ββ Topic: 61 | |
| β βββ ββfbi_gas_tear_bds_building ββ Topic: 23 | |
| ββuse_like_just_dont_new | |
| ββgame_team_year_games_like | |
| β ββgame_team_games_25_year | |
| β β ββgame_team_games_25_season | |
| β β β ββwindow_printer_use_problem_mhz | |
| β β β β ββmhz_wire_simms_wiring_battery | |
| β β β β β ββsimms_mhz_battery_cpu_heat | |
| β β β β β β ββsimms_pds_simm_vram_lc | |
| β β β β β β β βββ ββpds_nubus_lc_slot_card ββ Topic: 119 | |
| β β β β β β β βββ ββsimms_simm_vram_meg_dram ββ Topic: 32 | |
| β β β β β β ββmhz_battery_cpu_heat_speed | |
| β β β β β β ββmhz_cpu_speed_heat_fan | |
| β β β β β β β ββmhz_cpu_speed_heat_fan | |
| β β β β β β β β βββ ββfan_cpu_heat_sink_fans ββ Topic: 92 | |
| β β β β β β β β βββ ββmhz_speed_cpu_fpu_clock ββ Topic: 22 | |
| β β β β β β β βββ ββmonitor_turn_power_computer_electricity ββ Topic: 91 | |
| β β β β β β ββbattery_batteries_concrete_duo_discharge | |
| β β β β β β βββ ββduo_battery_apple_230_problem ββ Topic: 121 | |
| β β β β β β βββ ββbattery_batteries_concrete_discharge_temperature ββ Topic: 75 | |
| β β β β β ββwire_wiring_ground_neutral_outlets | |
| β β β β β ββwire_wiring_ground_neutral_outlets | |
| β β β β β β ββwire_wiring_ground_neutral_outlets | |
| β β β β β β β βββ ββleds_uv_blue_light_boards ββ Topic: 66 | |
| β β β β β β β βββ ββwire_wiring_ground_neutral_outlets ββ Topic: 120 | |
| β β β β β β ββscope_scopes_phone_dial_number | |
| β β β β β β βββ ββdial_number_phone_line_output ββ Topic: 93 | |
| β β β β β β βββ ββscope_scopes_motorola_generator_oscilloscope ββ Topic: 113 | |
| β β β β β ββcelp_dsp_sampling_antenna_digital | |
| β β β β β βββ ββantenna_antennas_receiver_cable_transmitter ββ Topic: 70 | |
| β β β β β βββ ββcelp_dsp_sampling_speech_voice ββ Topic: 52 | |
| β β β β ββwindow_printer_xv_mouse_windows | |
| β β β β ββwindow_xv_error_widget_problem | |
| β β β β β ββerror_symbol_undefined_xterm_rx | |
| β β β β β β βββ ββsymbol_error_undefined_doug_parse ββ Topic: 63 | |
| β β β β β β βββ ββrx_remote_server_xdm_xterm ββ Topic: 45 | |
| β β β β β ββwindow_xv_widget_application_expose | |
| β β β β β ββwindow_widget_expose_application_event | |
| β β β β β β βββ ββgc_mydisplay_draw_gxxor_drawing ββ Topic: 103 | |
| β β β β β β βββ ββwindow_widget_application_expose_event ββ Topic: 25 | |
| β β β β β ββxv_den_polygon_points_algorithm | |
| β β β β β βββ ββden_polygon_points_algorithm_polygons ββ Topic: 28 | |
| β β β β β βββ ββxv_24bit_image_bit_images ββ Topic: 57 | |
| β β β β ββprinter_fonts_print_mouse_postscript | |
| β β β β ββprinter_fonts_print_font_deskjet | |
| β β β β β βββ ββscanner_logitech_grayscale_ocr_scanman ββ Topic: 108 | |
| β β β β β ββprinter_fonts_print_font_deskjet | |
| β β β β β βββ ββprinter_print_deskjet_hp_ink ββ Topic: 18 | |
| β β β β β βββ ββfonts_font_truetype_tt_atm ββ Topic: 49 | |
| β β β β ββmouse_ghostscript_midi_driver_postscript | |
| β β β β ββghostscript_midi_postscript_files_file | |
| β β β β β βββ ββghostscript_postscript_pageview_ghostview_dsc ββ Topic: 104 | |
| β β β β β ββmidi_sound_file_windows_driver | |
| β β β β β βββ ββlocation_mar_file_host_rwrr ββ Topic: 83 | |
| β β β β β βββ ββmidi_sound_driver_blaster_soundblaster ββ Topic: 98 | |
| β β β β βββ ββmouse_driver_mice_ball_problem ββ Topic: 68 | |
| β β β ββgame_team_games_25_season | |
| β β β ββ1st_sale_condition_comics_hulk | |
| β β β β ββsale_condition_offer_asking_cd | |
| β β β β β ββcondition_stereo_amp_speakers_asking | |
| β β β β β β βββ ββmiles_car_amfm_toyota_cassette ββ Topic: 62 | |
| β β β β β β βββ ββamp_speakers_condition_stereo_audio ββ Topic: 24 | |
| β β β β β ββgames_sale_pom_cds_shipping | |
| β β β β β ββpom_cds_sale_shipping_cd | |
| β β β β β β βββ ββsize_shipping_sale_condition_mattress ββ Topic: 100 | |
| β β β β β β βββ ββpom_cds_cd_sale_picture ββ Topic: 37 | |
| β β β β β βββ ββgames_game_snes_sega_genesis ββ Topic: 40 | |
| β β β β ββ1st_hulk_comics_art_appears | |
| β β β β ββ1st_hulk_comics_art_appears | |
| β β β β β ββlens_tape_camera_backup_lenses | |
| β β β β β β βββ ββtape_backup_tapes_drive_4mm ββ Topic: 107 | |
| β β β β β β βββ ββlens_camera_lenses_zoom_pouch ββ Topic: 114 | |
| β β β β β ββ1st_hulk_comics_art_appears | |
| β β β β β βββ ββ1st_hulk_comics_art_appears ββ Topic: 105 | |
| β β β β β βββ ββbooks_book_cover_trek_chemistry ββ Topic: 125 | |
| β β β β ββtickets_hotel_ticket_voucher_package | |
| β β β β βββ ββhotel_voucher_package_vacation_room ββ Topic: 74 | |
| β β β β βββ ββtickets_ticket_june_airlines_july ββ Topic: 84 | |
| β β β ββgame_team_games_season_hockey | |
| β β β ββgame_hockey_team_25_550 | |
| β β β β βββ ββespn_pt_pts_game_la ββ Topic: 17 | |
| β β β β βββ ββteam_25_game_hockey_550 ββ Topic: 2 | |
| β β β βββ ββyear_game_hit_baseball_players ββ Topic: 0 | |
| β β ββbike_car_greek_insurance_msg | |
| β β ββcar_bike_insurance_cars_engine | |
| β β β ββcar_insurance_cars_radar_engine | |
| β β β β ββinsurance_health_private_care_canada | |
| β β β β β βββ ββinsurance_health_private_care_canada ββ Topic: 99 | |
| β β β β β βββ ββinsurance_car_accident_rates_sue ββ Topic: 82 | |
| β β β β ββcar_cars_radar_engine_detector | |
| β β β β ββcar_radar_cars_detector_engine | |
| β β β β β βββ ββradar_detector_detectors_ka_alarm ββ Topic: 39 | |
| β β β β β ββcar_cars_mustang_ford_engine | |
| β β β β β βββ ββclutch_shift_shifting_transmission_gear ββ Topic: 88 | |
| β β β β β βββ ββcar_cars_mustang_ford_v8 ββ Topic: 14 | |
| β β β β ββoil_diesel_odometer_diesels_car | |
| β β β β ββodometer_oil_sensor_car_drain | |
| β β β β β βββ ββodometer_sensor_speedo_gauge_mileage ββ Topic: 96 | |
| β β β β β βββ ββoil_drain_car_leaks_taillights ββ Topic: 102 | |
| β β β β βββ ββdiesel_diesels_emissions_fuel_oil ββ Topic: 79 | |
| β β β ββbike_riding_ride_bikes_motorcycle | |
| β β β ββbike_ride_riding_bikes_lane | |
| β β β β βββ ββbike_ride_riding_lane_car ββ Topic: 11 | |
| β β β β βββ ββbike_bikes_miles_honda_motorcycle ββ Topic: 19 | |
| β β β βββ ββcountersteering_bike_motorcycle_rear_shaft ββ Topic: 46 | |
| β β ββgreek_msg_kuwait_greece_water | |
| β β ββgreek_msg_kuwait_greece_water | |
| β β β ββgreek_msg_kuwait_greece_dog | |
| β β β β ββgreek_msg_kuwait_greece_dog | |
| β β β β β ββgreek_kuwait_greece_turkish_greeks | |
| β β β β β β βββ ββgreek_greece_turkish_greeks_cyprus ββ Topic: 71 | |
| β β β β β β βββ ββkuwait_iraq_iran_gulf_arabia ββ Topic: 76 | |
| β β β β β ββmsg_dog_drugs_drug_food | |
| β β β β β ββdog_dogs_cooper_trial_weaver | |
| β β β β β β βββ ββclinton_bush_quayle_reagan_panicking ββ Topic: 101 | |
| β β β β β β ββdog_dogs_cooper_trial_weaver | |
| β β β β β β βββ ββcooper_trial_weaver_spence_witnesses ββ Topic: 90 | |
| β β β β β β βββ ββdog_dogs_bike_trained_springer ββ Topic: 67 | |
| β β β β β ββmsg_drugs_drug_food_chinese | |
| β β β β β βββ ββmsg_food_chinese_foods_taste ββ Topic: 30 | |
| β β β β β βββ ββdrugs_drug_marijuana_cocaine_alcohol ββ Topic: 72 | |
| β β β β ββwater_theory_universe_science_larsons | |
| β β β β ββwater_nuclear_cooling_steam_dept | |
| β β β β β βββ ββrocketry_rockets_engines_nuclear_plutonium ββ Topic: 115 | |
| β β β β β ββwater_cooling_steam_dept_plants | |
| β β β β β βββ ββwater_dept_phd_environmental_atmospheric ββ Topic: 97 | |
| β β β β β βββ ββcooling_water_steam_towers_plants ββ Topic: 109 | |
| β β β β ββtheory_universe_larsons_larson_science | |
| β β β β βββ ββtheory_universe_larsons_larson_science ββ Topic: 54 | |
| β β β β βββ ββoort_cloud_grbs_gamma_burst ββ Topic: 80 | |
| β β β ββhelmet_kirlian_photography_lock_wax | |
| β β β ββhelmet_kirlian_photography_leaf_mask | |
| β β β β ββkirlian_photography_leaf_pictures_deleted | |
| β β β β β ββdeleted_joke_stuff_maddi_nickname | |
| β β β β β β βββ ββjoke_maddi_nickname_nicknames_frank ββ Topic: 43 | |
| β β β β β β βββ ββdeleted_stuff_bookstore_joke_motto ββ Topic: 81 | |
| β β β β β βββ ββkirlian_photography_leaf_pictures_aura ββ Topic: 85 | |
| β β β β ββhelmet_mask_liner_foam_cb | |
| β β β β βββ ββhelmet_liner_foam_cb_helmets ββ Topic: 112 | |
| β β β β βββ ββmask_goalies_77_santore_tl ββ Topic: 123 | |
| β β β ββlock_wax_paint_plastic_ear | |
| β β β βββ ββlock_cable_locks_bike_600 ββ Topic: 117 | |
| β β β ββwax_paint_ear_plastic_skin | |
| β β β βββ ββwax_paint_plastic_scratches_solvent ββ Topic: 65 | |
| β β β βββ ββear_wax_skin_greasy_acne ββ Topic: 116 | |
| β β ββm4_mp_14_mw_mo | |
| β β ββm4_mp_14_mw_mo | |
| β β β βββ ββm4_mp_14_mw_mo ββ Topic: 111 | |
| β β β βββ ββtest_ensign_nameless_deane_deanebinahccbrandeisedu ββ Topic: 118 | |
| β β βββ ββites_cheek_hello_hi_ken ββ Topic: 3 | |
| β ββspace_medical_health_disease_cancer | |
| β ββmedical_health_disease_cancer_patients | |
| β β βββ ββcancer_centers_center_medical_research ββ Topic: 122 | |
| β β ββhealth_medical_disease_patients_hiv | |
| β β ββpatients_medical_disease_candida_health | |
| β β β βββ ββcandida_yeast_infection_gonorrhea_infections ββ Topic: 48 | |
| β β β ββpatients_disease_cancer_medical_doctor | |
| β β β βββ ββhiv_medical_cancer_patients_doctor ββ Topic: 34 | |
| β β β βββ ββpain_drug_patients_disease_diet ββ Topic: 26 | |
| β β βββ ββhealth_newsgroup_tobacco_vote_votes ββ Topic: 9 | |
| β ββspace_launch_nasa_shuttle_orbit | |
| β ββspace_moon_station_nasa_launch | |
| β β βββ ββsky_advertising_billboard_billboards_space ββ Topic: 59 | |
| β β βββ ββspace_station_moon_redesign_nasa ββ Topic: 16 | |
| β ββspace_mission_hst_launch_orbit | |
| β ββspace_launch_nasa_orbit_propulsion | |
| β β βββ ββspace_launch_nasa_propulsion_astronaut ββ Topic: 47 | |
| β β βββ ββorbit_km_jupiter_probe_earth ββ Topic: 86 | |
| β βββ ββhst_mission_shuttle_orbit_arrays ββ Topic: 60 | |
| ββdrive_file_key_windows_use | |
| ββkey_file_jpeg_encryption_image | |
| β ββkey_encryption_clipper_chip_keys | |
| β β βββ ββkey_clipper_encryption_chip_keys ββ Topic: 1 | |
| β β βββ ββentry_file_ripem_entries_key ββ Topic: 73 | |
| β ββjpeg_image_file_gif_images | |
| β ββmotif_graphics_ftp_available_3d | |
| β β ββmotif_graphics_openwindows_ftp_available | |
| β β β βββ ββopenwindows_motif_xview_windows_mouse ββ Topic: 20 | |
| β β β βββ ββgraphics_widget_ray_3d_available ββ Topic: 95 | |
| β β βββ ββ3d_machines_version_comments_contact ββ Topic: 38 | |
| β ββjpeg_image_gif_images_format | |
| β βββ ββgopher_ftp_files_stuffit_images ββ Topic: 51 | |
| β βββ ββjpeg_image_gif_format_images ββ Topic: 13 | |
| ββdrive_db_card_scsi_windows | |
| ββdb_windows_dos_mov_os2 | |
| β βββ ββcopy_protection_program_software_disk ββ Topic: 64 | |
| β βββ ββdb_windows_dos_mov_os2 ββ Topic: 8 | |
| ββdrive_card_scsi_drives_ide | |
| ββdrive_scsi_drives_ide_disk | |
| β βββ ββdrive_scsi_drives_ide_disk ββ Topic: 6 | |
| β βββ ββmeg_sale_ram_drive_shipping ββ Topic: 12 | |
| ββcard_modem_monitor_video_drivers | |
| βββ ββcard_monitor_video_drivers_vga ββ Topic: 5 | |
| βββ ββmodem_port_serial_irq_com ββ Topic: 10 | |
| ``` | |
| </details> | |
| ## **Visualize Hierarchical Documents** | |
| We can extend the previous method by calculating the topic representation at different levels of the hierarchy and | |
| plotting them on a 2D plane. To do so, we first need to calculate the hierarchical topics: | |
| ```python | |
| from sklearn.datasets import fetch_20newsgroups | |
| from sentence_transformers import SentenceTransformer | |
| from bertopic import BERTopic | |
| from umap import UMAP | |
| # Prepare embeddings | |
| docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data'] | |
| sentence_model = SentenceTransformer("all-MiniLM-L6-v2") | |
| embeddings = sentence_model.encode(docs, show_progress_bar=False) | |
| # Train BERTopic and extract hierarchical topics | |
| topic_model = BERTopic().fit(docs, embeddings) | |
| hierarchical_topics = topic_model.hierarchical_topics(docs) | |
| ``` | |
| Then, we can visualize the hierarchical documents by either supplying it with our embeddings or by | |
| reducing their dimensionality ourselves: | |
| ```python | |
| # Run the visualization with the original embeddings | |
| topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, embeddings=embeddings) | |
| # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively: | |
| reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings) | |
| topic_model.visualize_hierarchical_documents(docs, hierarchical_topics, reduced_embeddings=reduced_embeddings) | |
| ``` | |
| <iframe src="hierarchical_documents.html" style="width:1200px; height: 800px; border: 0px;""></iframe> | |
| !!! note | |
| The visualization above was generated with the additional parameter `hide_document_hover=True` which disables the | |
| option to hover over the individual points and see the content of the documents. This makes the resulting visualization | |
| smaller and fit into your RAM. However, it might be interesting to set `hide_document_hover=False` to hover | |
| over the points and see the content of the documents. | |