Thang203 commited on
Commit
43df3cc
·
verified ·
1 Parent(s): c456dd1

Add BERTopic model

Browse files
Files changed (6) hide show
  1. README.md +182 -0
  2. config.json +17 -0
  3. ctfidf.bin +3 -0
  4. ctfidf_config.json +0 -0
  5. topic_embeddings.bin +3 -0
  6. topics.json +0 -0
README.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # topic_model_general_auto_april8
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("Thang203/topic_model_general_auto_april8")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 113
34
+ * Number of training documents: 6795
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | models - language - llms - language models - model | 10 | -1_models_language_llms_language models |
42
+ | 0 | visual - multimodal - image - images - video | 1955 | 0_visual_multimodal_image_images |
43
+ | 1 | reasoning - mathematical - cot - math - problems | 429 | 1_reasoning_mathematical_cot_math |
44
+ | 2 | students - education - chatgpt - student - ai | 315 | 2_students_education_chatgpt_student |
45
+ | 3 | medical - clinical - biomedical - healthcare - notes | 261 | 3_medical_clinical_biomedical_healthcare |
46
+ | 4 | translation - languages - machine translation - multilingual - machine | 215 | 4_translation_languages_machine translation_multilingual |
47
+ | 5 | code - code generation - generation - programming - python | 156 | 5_code_code generation_generation_programming |
48
+ | 6 | generation - story - text - text generation - gpt2 | 131 | 6_generation_story_text_text generation |
49
+ | 7 | rlhf - reward - alignment - preference - feedback | 85 | 7_rlhf_reward_alignment_preference |
50
+ | 8 | financial - sentiment - stock - market - investment | 78 | 8_financial_sentiment_stock_market |
51
+ | 9 | bias - gender - biases - gender bias - fairness | 77 | 9_bias_gender_biases_gender bias |
52
+ | 10 | summarization - summaries - abstractive - text summarization - summary | 77 | 10_summarization_summaries_abstractive_text summarization |
53
+ | 11 | emotion - emotional - empathetic - emotions - affective | 74 | 11_emotion_emotional_empathetic_emotions |
54
+ | 12 | radiology - medical - reports - radiology reports - image | 74 | 12_radiology_medical_reports_radiology reports |
55
+ | 13 | fewshot - zeroshot - learning - augmentation - data | 69 | 13_fewshot_zeroshot_learning_augmentation |
56
+ | 14 | game - games - agents - negotiation - llm agents | 69 | 14_game_games_agents_negotiation |
57
+ | 15 | dialogue - taskoriented - dialog - dialogue systems - systems | 68 | 15_dialogue_taskoriented_dialog_dialogue systems |
58
+ | 16 | text - detection - texts - aigenerated - detectors | 62 | 16_text_detection_texts_aigenerated |
59
+ | 17 | news - misinformation - fake - detection - fake news | 61 | 17_news_misinformation_fake_detection |
60
+ | 18 | quantization - quantized - weights - 4bit - memory | 61 | 18_quantization_quantized_weights_4bit |
61
+ | 19 | adversarial - attack - attacks - backdoor - adversarial examples | 60 | 19_adversarial_attack_attacks_backdoor |
62
+ | 20 | privacy - private - federated - privacypreserving - pii | 59 | 20_privacy_private_federated_privacypreserving |
63
+ | 21 | retrieval - ranking - rag - reranking - retrievalaugmented | 58 | 21_retrieval_ranking_rag_reranking |
64
+ | 22 | legal - patent - court - claim - law | 58 | 22_legal_patent_court_claim |
65
+ | 23 | code - software - developers - commit - code generation | 57 | 23_code_software_developers_commit |
66
+ | 24 | word - representations - negation - linguistic - sentence | 56 | 24_word_representations_negation_linguistic |
67
+ | 25 | recommendation - recommender - recommendations - recommender systems - user | 55 | 25_recommendation_recommender_recommendations_recommender systems |
68
+ | 26 | instruction - instruction tuning - tuning - instructions - data | 54 | 26_instruction_instruction tuning_tuning_instructions |
69
+ | 27 | pretraining - pretrained - seq2seq - tasks - masked | 54 | 27_pretraining_pretrained_seq2seq_tasks |
70
+ | 28 | vulnerability - vulnerabilities - security - code - smart | 54 | 28_vulnerability_vulnerabilities_security_code |
71
+ | 29 | transformer - transformers - layers - layer - attention | 48 | 29_transformer_transformers_layers_layer |
72
+ | 30 | jailbreak - attacks - jailbreaking - attack - safety | 44 | 30_jailbreak_attacks_jailbreaking_attack |
73
+ | 31 | ai - regulation - ethical - risk - regulatory | 43 | 31_ai_regulation_ethical_risk |
74
+ | 32 | materials - chemistry - chemical - molecular - materials science | 42 | 32_materials_chemistry_chemical_molecular |
75
+ | 33 | repair - bugs - bug - program repair - apr | 42 | 33_repair_bugs_bug_program repair |
76
+ | 34 | graph - graphs - graph reasoning - graph neural - graph data | 41 | 34_graph_graphs_graph reasoning_graph neural |
77
+ | 35 | speech - asr - speech recognition - audio - recognition | 41 | 35_speech_asr_speech recognition_audio |
78
+ | 36 | evaluation - nlg - metrics - human - text | 40 | 36_evaluation_nlg_metrics_human |
79
+ | 37 | personality - traits - personality traits - psychological - personas | 38 | 37_personality_traits_personality traits_psychological |
80
+ | 38 | agent - agents - language agents - environments - decisionmaking | 37 | 38_agent_agents_language agents_environments |
81
+ | 39 | texttosql - sql - database - spider - query | 36 | 39_texttosql_sql_database_spider |
82
+ | 40 | tom - cognitive - mind - theory mind - humans | 34 | 40_tom_cognitive_mind_theory mind |
83
+ | 41 | hate - hate speech - speech - offensive - hateful | 34 | 41_hate_hate speech_speech_offensive |
84
+ | 42 | question - qa - answering - question answering - questions | 34 | 42_question_qa_answering_question answering |
85
+ | 43 | incontext - icl - demonstrations - incontext learning - learning | 33 | 43_incontext_icl_demonstrations_incontext learning |
86
+ | 44 | navigation - robot - manipulation - embodied - robots | 33 | 44_navigation_robot_manipulation_embodied |
87
+ | 45 | hallucinations - hallucination - hallucination detection - detection - llms | 31 | 45_hallucinations_hallucination_hallucination detection_detection |
88
+ | 46 | commonsense - commonsense knowledge - knowledge - commonsense reasoning - commonsense question answering | 31 | 46_commonsense_commonsense knowledge_knowledge_commonsense reasoning |
89
+ | 47 | tool - tools - apis - api - tooluse | 31 | 47_tool_tools_apis_api |
90
+ | 48 | parallelism - training - distributed - distributed training - network | 30 | 48_parallelism_training_distributed_distributed training |
91
+ | 49 | brain - neural - gpt2 - circuit - attention | 30 | 49_brain_neural_gpt2_circuit |
92
+ | 50 | context - context window - window - length - extrapolation | 29 | 50_context_context window_window_length |
93
+ | 51 | knowledge - knowledge graph - kgs - wikidata - graph | 29 | 51_knowledge_knowledge graph_kgs_wikidata |
94
+ | 52 | chatbots - search - chatgpt - technology - chat | 28 | 52_chatbots_search_chatgpt_technology |
95
+ | 53 | cultural - political - opinions - values - survey | 28 | 53_cultural_political_opinions_values |
96
+ | 54 | sentiment - sentiment analysis - analysis - aspectbased - polarity | 28 | 54_sentiment_sentiment analysis_analysis_aspectbased |
97
+ | 55 | research - writing - ai - scientific - chatgpt | 28 | 55_research_writing_ai_scientific |
98
+ | 56 | music - musical - audio - lyrics - sounds | 28 | 56_music_musical_audio_lyrics |
99
+ | 57 | scaling - training - scaling laws - laws - emergent abilities | 28 | 57_scaling_training_scaling laws_laws |
100
+ | 58 | explanations - counterfactual - explanation - counterfactuals - natural language explanations | 27 | 58_explanations_counterfactual_explanation_counterfactuals |
101
+ | 59 | lora - lowrank - finetuning - adaptation - peft | 27 | 59_lora_lowrank_finetuning_adaptation |
102
+ | 60 | safety - unsafe - harmful - safety alignment - 2chat | 26 | 60_safety_unsafe_harmful_safety alignment |
103
+ | 61 | cybersecurity - cyber - security - genai - threat | 26 | 61_cybersecurity_cyber_security_genai |
104
+ | 62 | visualization - visualizations - data visualization - chart - natural language | 25 | 62_visualization_visualizations_data visualization_chart |
105
+ | 63 | attention - memory - matrix - linear - kv | 23 | 63_attention_memory_matrix_linear |
106
+ | 64 | correction - gec - grammatical - error - error correction | 23 | 64_correction_gec_grammatical_error |
107
+ | 65 | test - unit - tests - test generation - test cases | 22 | 65_test_unit_tests_test generation |
108
+ | 66 | entity - relation - ner - extraction - relation extraction | 22 | 66_entity_relation_ner_extraction |
109
+ | 67 | prompt - prompts - tuning - prompt tuning - optimization | 22 | 67_prompt_prompts_tuning_prompt tuning |
110
+ | 68 | distillation - teacher - student - kd - student model | 22 | 68_distillation_teacher_student_kd |
111
+ | 69 | pruning - sparsity - structured pruning - structured - weights | 21 | 69_pruning_sparsity_structured pruning_structured |
112
+ | 70 | hallucination - hallucinations - lvlms - mllms - visual | 21 | 70_hallucination_hallucinations_lvlms_mllms |
113
+ | 71 | ideas - creative - ai - creativity - fictional | 21 | 71_ideas_creative_ai_creativity |
114
+ | 72 | mental - mental health - health - depression - social media | 21 | 72_mental_mental health_health_depression |
115
+ | 73 | adversarial - vlms - attacks - attack - adversarial examples | 20 | 73_adversarial_vlms_attacks_attack |
116
+ | 74 | confidence - calibration - uncertainty - probabilities - confidence scores | 19 | 74_confidence_calibration_uncertainty_probabilities |
117
+ | 75 | crosslingual - multilingual - languages - english - transfer | 19 | 75_crosslingual_multilingual_languages_english |
118
+ | 76 | verilog - design - hardware - hardware design - rtl | 18 | 76_verilog_design_hardware_hardware design |
119
+ | 77 | intent - intent detection - slot - slot filling - detection | 17 | 77_intent_intent detection_slot_slot filling |
120
+ | 78 | arabic - hebrew - cultural - nlp - diacritization | 17 | 78_arabic_hebrew_cultural_nlp |
121
+ | 79 | watermarking - watermark - copyright - protection - ip | 16 | 79_watermarking_watermark_copyright_protection |
122
+ | 80 | robot - robots - dialogue - round - humanrobot | 16 | 80_robot_robots_dialogue_round |
123
+ | 81 | poetry - poems - poetry generation - lyrics - generation | 16 | 81_poetry_poems_poetry generation_lyrics |
124
+ | 82 | table - tabular - tables - tabular data - data | 16 | 82_table_tabular_tables_tabular data |
125
+ | 83 | spatial - geospatial - gis - geographic - location | 15 | 83_spatial_geospatial_gis_geographic |
126
+ | 84 | product - ecommerce - attribute - extraction - product descriptions | 15 | 84_product_ecommerce_attribute_extraction |
127
+ | 85 | geoscience - astronomy - scientific - astronomical - galactica | 15 | 85_geoscience_astronomy_scientific_astronomical |
128
+ | 86 | phishing - emails - phishing emails - email - phishing attacks | 15 | 86_phishing_emails_phishing emails_email |
129
+ | 87 | ai - generative ai - workers - generative - labor | 14 | 87_ai_generative ai_workers_generative |
130
+ | 88 | planning - robotic - robot - robogpt - task planning | 14 | 88_planning_robotic_robot_robogpt |
131
+ | 89 | mobile - wireless - edge - devices - aigc | 14 | 89_mobile_wireless_edge_devices |
132
+ | 90 | simplification - text simplification - sentence - text - readability | 14 | 90_simplification_text simplification_sentence_text |
133
+ | 91 | editing - knowledge editing - model editing - knowledge - editing methods | 14 | 91_editing_knowledge editing_model editing_knowledge |
134
+ | 92 | annotation - data annotation - metadata - annotators - data | 14 | 92_annotation_data annotation_metadata_annotators |
135
+ | 93 | gpu - hardware - communication - memory - accelerators | 14 | 93_gpu_hardware_communication_memory |
136
+ | 94 | argument - arguments - argumentation - fallacy - fallacies | 14 | 94_argument_arguments_argumentation_fallacy |
137
+ | 95 | toxicity - toxic - detoxification - content - toxic content | 14 | 95_toxicity_toxic_detoxification_content |
138
+ | 96 | causal - causal reasoning - causality - causal discovery - causal inference | 14 | 96_causal_causal reasoning_causality_causal discovery |
139
+ | 97 | design - bid - 3d - designs - generative | 14 | 97_design_bid_3d_designs |
140
+ | 98 | chinese - questions - subjects - school - ceval | 14 | 98_chinese_questions_subjects_school |
141
+ | 99 | scientific - papers - review - feedback - reviews | 13 | 99_scientific_papers_review_feedback |
142
+ | 100 | urban - traffic - transportation - foundation models - foundation | 13 | 100_urban_traffic_transportation_foundation models |
143
+ | 101 | humor - sarcasm - jokes - sarcasm detection - funny | 13 | 101_humor_sarcasm_jokes_sarcasm detection |
144
+ | 102 | analogical - analogies - analogy - analogical reasoning - metaphor | 12 | 102_analogical_analogies_analogy_analogical reasoning |
145
+ | 103 | public - early - sentiments - media - topics | 12 | 103_public_early_sentiments_media |
146
+ | 104 | optimizers - adam - deep - networks - training | 12 | 104_optimizers_adam_deep_networks |
147
+ | 105 | log - root - cloud - anomaly detection - anomaly | 12 | 105_log_root_cloud_anomaly detection |
148
+ | 106 | dialogue - norm - norms - conversations - persona | 12 | 106_dialogue_norm_norms_conversations |
149
+ | 107 | speculative - decoding - draft - speculative decoding - draft model | 11 | 107_speculative_decoding_draft_speculative decoding |
150
+ | 108 | protein - sequences - proteins - bioinformatics - protein sequence | 11 | 108_protein_sequences_proteins_bioinformatics |
151
+ | 109 | forgetting - catastrophic forgetting - catastrophic - continual - continual learning | 11 | 109_forgetting_catastrophic forgetting_catastrophic_continual |
152
+ | 110 | software - software engineering - software using - chatgpt - software testing | 11 | 110_software_software engineering_software using_chatgpt |
153
+ | 111 | verification - sva - configuration - proof - verified | 10 | 111_verification_sva_configuration_proof |
154
+
155
+ </details>
156
+
157
+ ## Training hyperparameters
158
+
159
+ * calculate_probabilities: False
160
+ * language: english
161
+ * low_memory: False
162
+ * min_topic_size: 10
163
+ * n_gram_range: (1, 1)
164
+ * nr_topics: None
165
+ * seed_topic_list: None
166
+ * top_n_words: 10
167
+ * verbose: True
168
+ * zeroshot_min_similarity: 0.7
169
+ * zeroshot_topic_list: None
170
+
171
+ ## Framework versions
172
+
173
+ * Numpy: 1.25.2
174
+ * HDBSCAN: 0.8.33
175
+ * UMAP: 0.5.6
176
+ * Pandas: 2.0.3
177
+ * Scikit-Learn: 1.2.2
178
+ * Sentence-transformers: 2.6.1
179
+ * Transformers: 4.38.2
180
+ * Numba: 0.58.1
181
+ * Plotly: 5.15.0
182
+ * Python: 3.10.12
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": "english",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null,
16
+ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
17
+ }
ctfidf.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8139319c9fae6ed30a23c9f6943ff70b620d7a15d1b4c1f2670d9ec2e2eb3316
3
+ size 6458883
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b759d94757ec4223d8a86dca08063fc8985666cf90f1b0961ea2ca41c962b2b6
3
+ size 174857
topics.json ADDED
The diff for this file is too large to render. See raw diff