Add BERTopic model

Browse files

Files changed (6) hide show

README.md +182 -0
config.json +17 -0
ctfidf.bin +3 -0
ctfidf_config.json +0 -0
topic_embeddings.bin +3 -0
topics.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+---
+tags:
+- bertopic
+library_name: bertopic
+pipeline_tag: text-classification
+---
+# topic_model_general_auto_april8
+This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
+BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
+## Usage
+To use this model, please install BERTopic:
+```
+pip install -U bertopic
+```
+You can use the model as follows:
+```python
+from bertopic import BERTopic
+topic_model = BERTopic.load("Thang203/topic_model_general_auto_april8")
+topic_model.get_topic_info()
+```
+## Topic overview
+* Number of topics: 113
+* Number of training documents: 6795
+<details>
+  <summary>Click here for an overview of all topics.</summary>
+  | Topic ID | Topic Keywords | Topic Frequency | Label |
+|----------|----------------|-----------------|-------|
+| -1 | models - language - llms - language models - model | 10 | -1_models_language_llms_language models |
+| 0 | visual - multimodal - image - images - video | 1955 | 0_visual_multimodal_image_images |
+| 1 | reasoning - mathematical - cot - math - problems | 429 | 1_reasoning_mathematical_cot_math |
+| 2 | students - education - chatgpt - student - ai | 315 | 2_students_education_chatgpt_student |
+| 3 | medical - clinical - biomedical - healthcare - notes | 261 | 3_medical_clinical_biomedical_healthcare |
+| 4 | translation - languages - machine translation - multilingual - machine | 215 | 4_translation_languages_machine translation_multilingual |
+| 5 | code - code generation - generation - programming - python | 156 | 5_code_code generation_generation_programming |
+| 6 | generation - story - text - text generation - gpt2 | 131 | 6_generation_story_text_text generation |
+| 7 | rlhf - reward - alignment - preference - feedback | 85 | 7_rlhf_reward_alignment_preference |
+| 8 | financial - sentiment - stock - market - investment | 78 | 8_financial_sentiment_stock_market |
+| 9 | bias - gender - biases - gender bias - fairness | 77 | 9_bias_gender_biases_gender bias |
+| 10 | summarization - summaries - abstractive - text summarization - summary | 77 | 10_summarization_summaries_abstractive_text summarization |
+| 11 | emotion - emotional - empathetic - emotions - affective | 74 | 11_emotion_emotional_empathetic_emotions |
+| 12 | radiology - medical - reports - radiology reports - image | 74 | 12_radiology_medical_reports_radiology reports |
+| 13 | fewshot - zeroshot - learning - augmentation - data | 69 | 13_fewshot_zeroshot_learning_augmentation |
+| 14 | game - games - agents - negotiation - llm agents | 69 | 14_game_games_agents_negotiation |
+| 15 | dialogue - taskoriented - dialog - dialogue systems - systems | 68 | 15_dialogue_taskoriented_dialog_dialogue systems |
+| 16 | text - detection - texts - aigenerated - detectors | 62 | 16_text_detection_texts_aigenerated |
+| 17 | news - misinformation - fake - detection - fake news | 61 | 17_news_misinformation_fake_detection |
+| 18 | quantization - quantized - weights - 4bit - memory | 61 | 18_quantization_quantized_weights_4bit |
+| 19 | adversarial - attack - attacks - backdoor - adversarial examples | 60 | 19_adversarial_attack_attacks_backdoor |
+| 20 | privacy - private - federated - privacypreserving - pii | 59 | 20_privacy_private_federated_privacypreserving |
+| 21 | retrieval - ranking - rag - reranking - retrievalaugmented | 58 | 21_retrieval_ranking_rag_reranking |
+| 22 | legal - patent - court - claim - law | 58 | 22_legal_patent_court_claim |
+| 23 | code - software - developers - commit - code generation | 57 | 23_code_software_developers_commit |
+| 24 | word - representations - negation - linguistic - sentence | 56 | 24_word_representations_negation_linguistic |
+| 25 | recommendation - recommender - recommendations - recommender systems - user | 55 | 25_recommendation_recommender_recommendations_recommender systems |
+| 26 | instruction - instruction tuning - tuning - instructions - data | 54 | 26_instruction_instruction tuning_tuning_instructions |
+| 27 | pretraining - pretrained - seq2seq - tasks - masked | 54 | 27_pretraining_pretrained_seq2seq_tasks |
+| 28 | vulnerability - vulnerabilities - security - code - smart | 54 | 28_vulnerability_vulnerabilities_security_code |
+| 29 | transformer - transformers - layers - layer - attention | 48 | 29_transformer_transformers_layers_layer |
+| 30 | jailbreak - attacks - jailbreaking - attack - safety | 44 | 30_jailbreak_attacks_jailbreaking_attack |
+| 31 | ai - regulation - ethical - risk - regulatory | 43 | 31_ai_regulation_ethical_risk |
+| 32 | materials - chemistry - chemical - molecular - materials science | 42 | 32_materials_chemistry_chemical_molecular |
+| 33 | repair - bugs - bug - program repair - apr | 42 | 33_repair_bugs_bug_program repair |
+| 34 | graph - graphs - graph reasoning - graph neural - graph data | 41 | 34_graph_graphs_graph reasoning_graph neural |
+| 35 | speech - asr - speech recognition - audio - recognition | 41 | 35_speech_asr_speech recognition_audio |
+| 36 | evaluation - nlg - metrics - human - text | 40 | 36_evaluation_nlg_metrics_human |
+| 37 | personality - traits - personality traits - psychological - personas | 38 | 37_personality_traits_personality traits_psychological |
+| 38 | agent - agents - language agents - environments - decisionmaking | 37 | 38_agent_agents_language agents_environments |
+| 39 | texttosql - sql - database - spider - query | 36 | 39_texttosql_sql_database_spider |
+| 40 | tom - cognitive - mind - theory mind - humans | 34 | 40_tom_cognitive_mind_theory mind |
+| 41 | hate - hate speech - speech - offensive - hateful | 34 | 41_hate_hate speech_speech_offensive |
+| 42 | question - qa - answering - question answering - questions | 34 | 42_question_qa_answering_question answering |
+| 43 | incontext - icl - demonstrations - incontext learning - learning | 33 | 43_incontext_icl_demonstrations_incontext learning |
+| 44 | navigation - robot - manipulation - embodied - robots | 33 | 44_navigation_robot_manipulation_embodied |
+| 45 | hallucinations - hallucination - hallucination detection - detection - llms | 31 | 45_hallucinations_hallucination_hallucination detection_detection |
+| 46 | commonsense - commonsense knowledge - knowledge - commonsense reasoning - commonsense question answering | 31 | 46_commonsense_commonsense knowledge_knowledge_commonsense reasoning |
+| 47 | tool - tools - apis - api - tooluse | 31 | 47_tool_tools_apis_api |
+| 48 | parallelism - training - distributed - distributed training - network | 30 | 48_parallelism_training_distributed_distributed training |
+| 49 | brain - neural - gpt2 - circuit - attention | 30 | 49_brain_neural_gpt2_circuit |
+| 50 | context - context window - window - length - extrapolation | 29 | 50_context_context window_window_length |
+| 51 | knowledge - knowledge graph - kgs - wikidata - graph | 29 | 51_knowledge_knowledge graph_kgs_wikidata |
+| 52 | chatbots - search - chatgpt - technology - chat | 28 | 52_chatbots_search_chatgpt_technology |
+| 53 | cultural - political - opinions - values - survey | 28 | 53_cultural_political_opinions_values |
+| 54 | sentiment - sentiment analysis - analysis - aspectbased - polarity | 28 | 54_sentiment_sentiment analysis_analysis_aspectbased |
+| 55 | research - writing - ai - scientific - chatgpt | 28 | 55_research_writing_ai_scientific |
+| 56 | music - musical - audio - lyrics - sounds | 28 | 56_music_musical_audio_lyrics |
+| 57 | scaling - training - scaling laws - laws - emergent abilities | 28 | 57_scaling_training_scaling laws_laws |
+| 58 | explanations - counterfactual - explanation - counterfactuals - natural language explanations | 27 | 58_explanations_counterfactual_explanation_counterfactuals |
+| 59 | lora - lowrank - finetuning - adaptation - peft | 27 | 59_lora_lowrank_finetuning_adaptation |
+| 60 | safety - unsafe - harmful - safety alignment - 2chat | 26 | 60_safety_unsafe_harmful_safety alignment |
+| 61 | cybersecurity - cyber - security - genai - threat | 26 | 61_cybersecurity_cyber_security_genai |
+| 62 | visualization - visualizations - data visualization - chart - natural language | 25 | 62_visualization_visualizations_data visualization_chart |
+| 63 | attention - memory - matrix - linear - kv | 23 | 63_attention_memory_matrix_linear |
+| 64 | correction - gec - grammatical - error - error correction | 23 | 64_correction_gec_grammatical_error |
+| 65 | test - unit - tests - test generation - test cases | 22 | 65_test_unit_tests_test generation |
+| 66 | entity - relation - ner - extraction - relation extraction | 22 | 66_entity_relation_ner_extraction |
+| 67 | prompt - prompts - tuning - prompt tuning - optimization | 22 | 67_prompt_prompts_tuning_prompt tuning |
+| 68 | distillation - teacher - student - kd - student model | 22 | 68_distillation_teacher_student_kd |
+| 69 | pruning - sparsity - structured pruning - structured - weights | 21 | 69_pruning_sparsity_structured pruning_structured |
+| 70 | hallucination - hallucinations - lvlms - mllms - visual | 21 | 70_hallucination_hallucinations_lvlms_mllms |
+| 71 | ideas - creative - ai - creativity - fictional | 21 | 71_ideas_creative_ai_creativity |
+| 72 | mental - mental health - health - depression - social media | 21 | 72_mental_mental health_health_depression |
+| 73 | adversarial - vlms - attacks - attack - adversarial examples | 20 | 73_adversarial_vlms_attacks_attack |
+| 74 | confidence - calibration - uncertainty - probabilities - confidence scores | 19 | 74_confidence_calibration_uncertainty_probabilities |
+| 75 | crosslingual - multilingual - languages - english - transfer | 19 | 75_crosslingual_multilingual_languages_english |
+| 76 | verilog - design - hardware - hardware design - rtl | 18 | 76_verilog_design_hardware_hardware design |
+| 77 | intent - intent detection - slot - slot filling - detection | 17 | 77_intent_intent detection_slot_slot filling |
+| 78 | arabic - hebrew - cultural - nlp - diacritization | 17 | 78_arabic_hebrew_cultural_nlp |
+| 79 | watermarking - watermark - copyright - protection - ip | 16 | 79_watermarking_watermark_copyright_protection |
+| 80 | robot - robots - dialogue - round - humanrobot | 16 | 80_robot_robots_dialogue_round |
+| 81 | poetry - poems - poetry generation - lyrics - generation | 16 | 81_poetry_poems_poetry generation_lyrics |
+| 82 | table - tabular - tables - tabular data - data | 16 | 82_table_tabular_tables_tabular data |
+| 83 | spatial - geospatial - gis - geographic - location | 15 | 83_spatial_geospatial_gis_geographic |
+| 84 | product - ecommerce - attribute - extraction - product descriptions | 15 | 84_product_ecommerce_attribute_extraction |
+| 85 | geoscience - astronomy - scientific - astronomical - galactica | 15 | 85_geoscience_astronomy_scientific_astronomical |
+| 86 | phishing - emails - phishing emails - email - phishing attacks | 15 | 86_phishing_emails_phishing emails_email |
+| 87 | ai - generative ai - workers - generative - labor | 14 | 87_ai_generative ai_workers_generative |
+| 88 | planning - robotic - robot - robogpt - task planning | 14 | 88_planning_robotic_robot_robogpt |
+| 89 | mobile - wireless - edge - devices - aigc | 14 | 89_mobile_wireless_edge_devices |
+| 90 | simplification - text simplification - sentence - text - readability | 14 | 90_simplification_text simplification_sentence_text |
+| 91 | editing - knowledge editing - model editing - knowledge - editing methods | 14 | 91_editing_knowledge editing_model editing_knowledge |
+| 92 | annotation - data annotation - metadata - annotators - data | 14 | 92_annotation_data annotation_metadata_annotators |
+| 93 | gpu - hardware - communication - memory - accelerators | 14 | 93_gpu_hardware_communication_memory |
+| 94 | argument - arguments - argumentation - fallacy - fallacies | 14 | 94_argument_arguments_argumentation_fallacy |
+| 95 | toxicity - toxic - detoxification - content - toxic content | 14 | 95_toxicity_toxic_detoxification_content |
+| 96 | causal - causal reasoning - causality - causal discovery - causal inference | 14 | 96_causal_causal reasoning_causality_causal discovery |
+| 97 | design - bid - 3d - designs - generative | 14 | 97_design_bid_3d_designs |
+| 98 | chinese - questions - subjects - school - ceval | 14 | 98_chinese_questions_subjects_school |
+| 99 | scientific - papers - review - feedback - reviews | 13 | 99_scientific_papers_review_feedback |
+| 100 | urban - traffic - transportation - foundation models - foundation | 13 | 100_urban_traffic_transportation_foundation models |
+| 101 | humor - sarcasm - jokes - sarcasm detection - funny | 13 | 101_humor_sarcasm_jokes_sarcasm detection |
+| 102 | analogical - analogies - analogy - analogical reasoning - metaphor | 12 | 102_analogical_analogies_analogy_analogical reasoning |
+| 103 | public - early - sentiments - media - topics | 12 | 103_public_early_sentiments_media |
+| 104 | optimizers - adam - deep - networks - training | 12 | 104_optimizers_adam_deep_networks |
+| 105 | log - root - cloud - anomaly detection - anomaly | 12 | 105_log_root_cloud_anomaly detection |
+| 106 | dialogue - norm - norms - conversations - persona | 12 | 106_dialogue_norm_norms_conversations |
+| 107 | speculative - decoding - draft - speculative decoding - draft model | 11 | 107_speculative_decoding_draft_speculative decoding |
+| 108 | protein - sequences - proteins - bioinformatics - protein sequence | 11 | 108_protein_sequences_proteins_bioinformatics |
+| 109 | forgetting - catastrophic forgetting - catastrophic - continual - continual learning | 11 | 109_forgetting_catastrophic forgetting_catastrophic_continual |
+| 110 | software - software engineering - software using - chatgpt - software testing | 11 | 110_software_software engineering_software using_chatgpt |
+| 111 | verification - sva - configuration - proof - verified | 10 | 111_verification_sva_configuration_proof |
+</details>
+## Training hyperparameters
+* calculate_probabilities: False
+* language: english
+* low_memory: False
+* min_topic_size: 10
+* n_gram_range: (1, 1)
+* nr_topics: None
+* seed_topic_list: None
+* top_n_words: 10
+* verbose: True
+* zeroshot_min_similarity: 0.7
+* zeroshot_topic_list: None
+## Framework versions
+* Numpy: 1.25.2
+* HDBSCAN: 0.8.33
+* UMAP: 0.5.6
+* Pandas: 2.0.3
+* Scikit-Learn: 1.2.2
+* Sentence-transformers: 2.6.1
+* Transformers: 4.38.2
+* Numba: 0.58.1
+* Plotly: 5.15.0
+* Python: 3.10.12

config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "calculate_probabilities": false,
+  "language": "english",
+  "low_memory": false,
+  "min_topic_size": 10,
+  "n_gram_range": [
+    1,
+    1
+  ],
+  "nr_topics": null,
+  "seed_topic_list": null,
+  "top_n_words": 10,
+  "verbose": true,
+  "zeroshot_min_similarity": 0.7,
+  "zeroshot_topic_list": null,
+  "embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
+}

ctfidf.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8139319c9fae6ed30a23c9f6943ff70b620d7a15d1b4c1f2670d9ec2e2eb3316
+size 6458883

ctfidf_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

topic_embeddings.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b759d94757ec4223d8a86dca08063fc8985666cf90f1b0961ea2ca41c962b2b6
+size 174857

topics.json ADDED Viewed

The diff for this file is too large to render. See raw diff