Add BERTopic model
Browse files- README.md +189 -0
- config.json +16 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
---
|
| 3 |
+
tags:
|
| 4 |
+
- bertopic
|
| 5 |
+
library_name: bertopic
|
| 6 |
+
pipeline_tag: text-classification
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# BERTopic_ML-ArXiv-Abstracts
|
| 10 |
+
|
| 11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
| 12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
| 13 |
+
|
| 14 |
+
## Usage
|
| 15 |
+
|
| 16 |
+
To use this model, please install BERTopic:
|
| 17 |
+
|
| 18 |
+
```
|
| 19 |
+
pip install -U bertopic
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
You can use the model as follows:
|
| 23 |
+
|
| 24 |
+
```python
|
| 25 |
+
from bertopic import BERTopic
|
| 26 |
+
topic_model = BERTopic.load("b-verma/BERTopic_ML-ArXiv-Abstracts")
|
| 27 |
+
|
| 28 |
+
topic_model.get_topic_info()
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
## Topic overview
|
| 32 |
+
|
| 33 |
+
* Number of topics: 120
|
| 34 |
+
* Number of training documents: 117592
|
| 35 |
+
|
| 36 |
+
<details>
|
| 37 |
+
<summary>Click here for an overview of all topics.</summary>
|
| 38 |
+
|
| 39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
| 40 |
+
|----------|----------------|-----------------|-------|
|
| 41 |
+
| -1 | data - models - model - learning - based | 153 | Machine Learning and Deep Learning |
|
| 42 |
+
| 0 | policy - reinforcement - reinforcement learning - rl - agent | 32094 | Reinforcement Learning and Control |
|
| 43 |
+
| 1 | graph - node - graphs - nodes - gnns | 10423 | Graph Embedding and Representation Learning |
|
| 44 |
+
| 2 | speech - audio - speaker - music - asr | 3598 | Speech Technology |
|
| 45 |
+
| 3 | 3d - object - video - segmentation - point | 3527 | 3D Object Understanding |
|
| 46 |
+
| 4 | equations - differential - physics - differential equations - pdes | 2706 | Discovering and Solving Partial Differential Equations |
|
| 47 |
+
| 5 | adversarial - attacks - adversarial examples - robustness - attack | 2287 | Adversarial Robustness |
|
| 48 |
+
| 6 | networks - relu - neural - neural networks - activation | 2210 | Deep Learning Activation Functions |
|
| 49 |
+
| 7 | segmentation - medical - images - image - tumor | 2207 | Medical Image Segmentation |
|
| 50 |
+
| 8 | gradient - stochastic - sgd - convergence - convex | 1835 | Convergence Analysis of Non-Convex Optimization Algorithms |
|
| 51 |
+
| 9 | federated - fl - federated learning - clients - privacy | 1717 | Federated Learning and Privacy |
|
| 52 |
+
| 10 | channel - wireless - radio - network - communication | 1698 | Channel Allocation and Estimation in Wireless Communications |
|
| 53 |
+
| 11 | privacy - private - differential privacy - dp - differential | 1449 | Privacy-Preserving Machine Learning |
|
| 54 |
+
| 12 | clinical - patient - patients - medical - health | 1353 | Clinical Patient Representation Learning |
|
| 55 |
+
| 13 | gans - gan - generative - generative adversarial - generator | 1298 | Generative Adversarial Networks (GANs) |
|
| 56 |
+
| 14 | bandit - regret - arm - bandits - arms | 1269 | Armed Bandit Problems |
|
| 57 |
+
| 15 | financial - stock - market - trading - price | 1246 | Financial Time Series Analysis |
|
| 58 |
+
| 16 | recommendation - user - item - recommender - items | 1193 | Recommendation Systems |
|
| 59 |
+
| 17 | power - energy - electricity - load - forecasting | 1144 | Power and Energy Forecasting |
|
| 60 |
+
| 18 | causal - treatment - observational - effect - causal inference | 1070 | Causal Inference and Learning |
|
| 61 |
+
| 19 | explanations - explanation - counterfactual - interpretability - interpretable | 1048 | Explanation Methods for Machine Learning Models |
|
| 62 |
+
| 20 | driving - autonomous - vehicle - vehicles - driver | 1007 | Autonomous Driving |
|
| 63 |
+
| 21 | malware - detection - iot - security - attacks | 959 | Cybersecurity Threats in IoT Networks |
|
| 64 |
+
| 22 | quantum - classical - circuit - circuits - quantum machine | 947 | Quantum Machine Learning |
|
| 65 |
+
| 23 | fairness - fair - bias - discrimination - protected | 938 | Fair Machine Learning |
|
| 66 |
+
| 24 | hardware - memory - gpu - dnn - accelerators | 924 | Edge AI Hardware for Efficient DNN Inference |
|
| 67 |
+
| 25 | clustering - means - clusters - cluster - algorithm | 921 | Clustering Algorithms |
|
| 68 |
+
| 26 | crop - images - satellite - remote sensing - hyperspectral | 899 | Remote Sensing and Deep Learning |
|
| 69 |
+
| 27 | time series - series - time - forecasting - series forecasting | 847 | Time Series Analysis and Forecasting |
|
| 70 |
+
| 28 | pruning - compression - sparsity - sparse - network | 836 | Neural Network Pruning |
|
| 71 |
+
| 29 | distributed - communication - sgd - decentralized - gradient | 822 | Distributed Optimization Methods |
|
| 72 |
+
| 30 | label - labels - multi label - noisy - noisy labels | 812 | Multi-Label Learning |
|
| 73 |
+
| 31 | meta - meta learning - shot - task - shot learning | 776 | Few-Shot Learning and Meta-Learning |
|
| 74 |
+
| 32 | traffic - temporal - travel - spatial - road | 770 | Traffic Forecasting and Prediction |
|
| 75 |
+
| 33 | anomaly - anomaly detection - detection - anomalies - outlier | 750 | Anomaly Detection |
|
| 76 |
+
| 34 | uncertainty - calibration - bayesian - bayesian neural - bayesian neural networks | 735 | Uncertainty Estimation in Deep Learning |
|
| 77 |
+
| 35 | variational - inference - posterior - mcmc - carlo | 724 | Inference and Approximation |
|
| 78 |
+
| 36 | domain - domain adaptation - adaptation - source - target | 717 | Unsupervised Domain Adaptation |
|
| 79 |
+
| 37 | continual - continual learning - forgetting - catastrophic forgetting - catastrophic | 679 | Continual Learning and Forgetting |
|
| 80 |
+
| 38 | vae - latent - variational - vaes - generative | 678 | Disentangled Representation Learning |
|
| 81 |
+
| 39 | visual - image - vqa - modal - captioning | 627 | Multimodal Vision and Language Understanding |
|
| 82 |
+
| 40 | code - program - software - programs - source code | 621 | Software Engineering |
|
| 83 |
+
| 41 | brain - fmri - functional - disease - ad | 615 | Brain Connectivity and Disease Diagnosis |
|
| 84 |
+
| 42 | spiking - snns - spike - neurons - spiking neural | 603 | Spiking Neural Networks (SNNs) |
|
| 85 |
+
| 43 | activity - activity recognition - har - gait - sensor | 600 | Human Activity Recognition (HAR) |
|
| 86 |
+
| 44 | dictionary - sparse - signal - dictionary learning - recovery | 595 | Sparse Signal Processing |
|
| 87 |
+
| 45 | news - social - media - fake - fake news | 580 | Fake News Detection |
|
| 88 |
+
| 46 | automl - ml - machine learning - machine - research | 542 | Automated Machine Learning (AutoML) |
|
| 89 |
+
| 47 | class - imbalanced - classifiers - minority - classification | 500 | Class Imbalance in Classification |
|
| 90 |
+
| 48 | gravitational - galaxy - solar - simulations - mass | 491 | Gravitational Wave Detection and Analysis |
|
| 91 |
+
| 49 | molecular - molecules - chemical - drug - molecule | 481 | Molecular Design and Discovery |
|
| 92 |
+
| 50 | recurrent - rnns - rnn - recurrent neural - lstm | 479 | Recurrent Neural Networks (RNNs) |
|
| 93 |
+
| 51 | bo - bayesian optimization - optimization - bayesian - function | 476 | Global Optimization with Bayesian Methods |
|
| 94 |
+
| 52 | logic - reasoning - symbolic - logical - relational | 473 | Integrating Reasoning and Learning |
|
| 95 |
+
| 53 | climate - weather - precipitation - water - forecasting | 472 | Climate and Weather Prediction |
|
| 96 |
+
| 54 | gp - gaussian - gaussian process - gaussian processes - processes | 470 | Scalable Gaussian Process Inference for Large Datasets |
|
| 97 |
+
| 55 | regret - online - online learning - convex - bounds | 456 | Online Learning and Regret Bounds |
|
| 98 |
+
| 56 | language - bert - fine - language models - fine tuning | 455 | Fine-tuning Language Models |
|
| 99 |
+
| 57 | nas - search - architecture search - architecture - neural architecture | 453 | Neural Architecture Search (NAS) |
|
| 100 |
+
| 58 | eeg - bci - brain - eeg signals - signals | 453 | Emotion and Brain Signals Analysis |
|
| 101 |
+
| 59 | dialogue - dialog - conversational - responses - conversation | 433 | Conversational AI Models |
|
| 102 |
+
| 60 | emotion - emotion recognition - facial - recognition - emotions | 417 | Emotion Recognition |
|
| 103 |
+
| 61 | knowledge - knowledge graph - knowledge graphs - kg - entities | 409 | Embedding Knowledge Graphs |
|
| 104 |
+
| 62 | active learning - active - al - learning - labeling | 388 | Active Learning |
|
| 105 |
+
| 63 | quantization - precision - bit - quantized - floating | 378 | Quantization for Deep Neural Networks |
|
| 106 |
+
| 64 | materials - molecular - chemical - atomic - material | 356 | Materials Discovery and Property Prediction using Machine Learning |
|
| 107 |
+
| 65 | bounds - pac - bound - generalization - pac bayes | 352 | Generalization Bounds |
|
| 108 |
+
| 66 | fault - maintenance - industrial - manufacturing - monitoring | 329 | Fault Detection and Diagnosis in Industrial Settings |
|
| 109 |
+
| 67 | translation - machine translation - nmt - neural machine translation - neural machine | 329 | Machine Translation |
|
| 110 |
+
| 68 | tensor - tensors - rank - decomposition - tensor completion | 328 | Tensor Completion and Rank Decomposition |
|
| 111 |
+
| 69 | topic - topics - topic models - lda - topic modeling | 325 | Topic Modeling |
|
| 112 |
+
| 70 | covid - covid 19 - 19 - chest - ct | 312 | Computer-Aided Diagnosis of COVID-19 |
|
| 113 |
+
| 71 | teacher - distillation - student - knowledge distillation - knowledge | 310 | Knowledge Transfer and Distillation |
|
| 114 |
+
| 72 | students - student - course - courses - educational | 310 | Education Technology |
|
| 115 |
+
| 73 | combinatorial - problems - combinatorial optimization - problem - solvers | 310 | Combinatorial Optimization |
|
| 116 |
+
| 74 | trees - tree - forest - decision - decision trees | 304 | Interpretable Machine Learning Models |
|
| 117 |
+
| 75 | contrastive - contrastive learning - self supervised - supervised - self | 303 | Contrastive Learning for Representation Learning |
|
| 118 |
+
| 76 | face - face recognition - facial - deepfake - recognition | 298 | Face Recognition and Bias |
|
| 119 |
+
| 77 | lasso - regression - sparse - screening - sparsity | 298 | High-Dimensional Sparse Regression |
|
| 120 |
+
| 78 | kernel - kernels - random - regression - ridge | 296 | Kernel Methods and Regression |
|
| 121 |
+
| 79 | seismic - inversion - reservoir - oil - velocity | 295 | Seismic Inverse Modeling |
|
| 122 |
+
| 80 | backdoor - poisoning - attacks - attack - backdoor attacks | 292 | Backdoor Attacks |
|
| 123 |
+
| 81 | manifold - manifold learning - dimensional - manifolds - dimensionality | 288 | Manifold Learning and Dimensionality Reduction |
|
| 124 |
+
| 82 | ecg - heart - electrocardiogram - cardiac - signals | 288 | Cardiac Signal Processing and Classification |
|
| 125 |
+
| 83 | attention - vision - vit - transformers - transformer | 286 | Computer Vision Transformers |
|
| 126 |
+
| 84 | word - embeddings - word embeddings - words - embedding | 283 | "Word Embeddings and Their Applications in Natural Language Processing" |
|
| 127 |
+
| 85 | question - qa - questions - answering - answer | 279 | Question Answering |
|
| 128 |
+
| 86 | denoising - image - noise - restoration - image denoising | 276 | Image Denoising |
|
| 129 |
+
| 87 | ctr - product - commerce - click - user | 274 | Advertising and Predictive Modeling |
|
| 130 |
+
| 88 | graphical - graphical models - belief propagation - belief - ising | 266 | Inference and Learning in Graphical Models |
|
| 131 |
+
| 89 | transport - ot - optimal transport - wasserstein - optimal | 264 | Optimal Transport and Related Methods |
|
| 132 |
+
| 90 | matrix - rank - matrix completion - completion - low rank | 259 | Low Rank Matrix Completion |
|
| 133 |
+
| 91 | covid - covid 19 - 19 - pandemic - spread | 255 | COVID-19 Forecasting and Prediction |
|
| 134 |
+
| 92 | svm - support vector - support - svms - vector | 255 | Machine Learning - SVM |
|
| 135 |
+
| 93 | physics - particle - detector - high energy - energy | 244 | High Energy Particle Physics |
|
| 136 |
+
| 94 | feature selection - feature - selection - features - feature selection methods | 240 | Feature Selection for High-Dimensional Data |
|
| 137 |
+
| 95 | ranking - items - rank - pairwise - comparisons | 232 | Ranking and Learning from Noisy Comparisons |
|
| 138 |
+
| 96 | hyperparameter - hpo - hyperparameters - hyperparameter optimization - optimization | 229 | Hyperparameter Optimization for Deep Learning Models |
|
| 139 |
+
| 97 | pricing - revenue - price - auctions - regret | 227 | Dynamic Pricing and Demand Learning |
|
| 140 |
+
| 98 | ood - ood detection - distribution - distribution ood - detection | 225 | Out-of-Distribution Detection in Deep Learning |
|
| 141 |
+
| 99 | bayesian - bayesian networks - bayesian network - structure - structure learning | 224 | Bayesian Network Structure Learning |
|
| 142 |
+
| 100 | pca - principal - principal component - component analysis - principal component analysis | 224 | Principal Component Analysis (PCA) |
|
| 143 |
+
| 101 | protein - proteins - sequence - sequences - structure | 214 | Protein Representation and Prediction |
|
| 144 |
+
| 102 | hashing - hash - codes - retrieval - search | 209 | Large-Scale Image Retrieval and Hashing |
|
| 145 |
+
| 103 | submodular - submodular functions - functions - maximization - approximation | 207 | Submodular Function Minimization |
|
| 146 |
+
| 104 | mixture - em - mixtures - em algorithm - mixture models | 206 | Mixture Models and EM Algorithm |
|
| 147 |
+
| 105 | metric learning - metric - distance - distance metric - similarity | 206 | Metric Learning for Machine Learning |
|
| 148 |
+
| 106 | equivariant - equivariance - group - symmetry - spherical | 200 | Equivariant Deep Learning |
|
| 149 |
+
| 107 | nmf - nonnegative - factorization - matrix - matrix factorization | 199 | NMF (Nonnegative Matrix Factorization) |
|
| 150 |
+
| 108 | compression - video - coding - distortion - rate distortion | 198 | Neural Compression |
|
| 151 |
+
| 109 | mri - reconstruction - pet - imaging - mr | 198 | Magnetic Resonance Imaging Reconstruction |
|
| 152 |
+
| 110 | oct - retinal - dr - diabetic - images | 197 | Retinal Imaging and Disease Diagnosis |
|
| 153 |
+
| 111 | entity - relation - relation extraction - entities - extraction | 192 | Relation Extraction |
|
| 154 |
+
| 112 | handwritten - text - characters - character - recognition | 178 | Handwritten Character Recognition |
|
| 155 |
+
| 113 | augmentation - data augmentation - mixup - data - augmentations | 178 | Data Augmentation for Improving Deep Learning Performance |
|
| 156 |
+
| 114 | crowdsourcing - workers - crowd - worker - crowdsourced | 168 | Crowdsourcing Labeling and Annotation |
|
| 157 |
+
| 115 | summarization - summaries - summary - abstractive - text | 165 | Automatic Summarization |
|
| 158 |
+
| 116 | circuit - design - circuits - chip - synthesis | 161 | Circuit Design Optimization |
|
| 159 |
+
| 117 | view - multi view - views - multi - clustering | 160 | Multi-View Clustering |
|
| 160 |
+
| 118 | cancer - gene - genes - disease - expression | 158 | Cancer Gene Expression Analysis |
|
| 161 |
+
|
| 162 |
+
</details>
|
| 163 |
+
|
| 164 |
+
## Training hyperparameters
|
| 165 |
+
|
| 166 |
+
* calculate_probabilities: True
|
| 167 |
+
* language: None
|
| 168 |
+
* low_memory: False
|
| 169 |
+
* min_topic_size: 10
|
| 170 |
+
* n_gram_range: (1, 1)
|
| 171 |
+
* nr_topics: None
|
| 172 |
+
* seed_topic_list: None
|
| 173 |
+
* top_n_words: 10
|
| 174 |
+
* verbose: True
|
| 175 |
+
* zeroshot_min_similarity: 0.7
|
| 176 |
+
* zeroshot_topic_list: None
|
| 177 |
+
|
| 178 |
+
## Framework versions
|
| 179 |
+
|
| 180 |
+
* Numpy: 2.1.3
|
| 181 |
+
* HDBSCAN: 0.8.40
|
| 182 |
+
* UMAP: 0.5.7
|
| 183 |
+
* Pandas: 2.2.3
|
| 184 |
+
* Scikit-Learn: 1.6.1
|
| 185 |
+
* Sentence-transformers: 3.4.1
|
| 186 |
+
* Transformers: 4.49.0
|
| 187 |
+
* Numba: 0.61.0
|
| 188 |
+
* Plotly: 6.0.1
|
| 189 |
+
* Python: 3.10.16
|
config.json
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"calculate_probabilities": true,
|
| 3 |
+
"language": null,
|
| 4 |
+
"low_memory": false,
|
| 5 |
+
"min_topic_size": 10,
|
| 6 |
+
"n_gram_range": [
|
| 7 |
+
1,
|
| 8 |
+
1
|
| 9 |
+
],
|
| 10 |
+
"nr_topics": null,
|
| 11 |
+
"seed_topic_list": null,
|
| 12 |
+
"top_n_words": 10,
|
| 13 |
+
"verbose": true,
|
| 14 |
+
"zeroshot_min_similarity": 0.7,
|
| 15 |
+
"zeroshot_topic_list": null
|
| 16 |
+
}
|
ctfidf.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e76609192920879b96386c0c4744cd6d61d917b6f6b04365fc32f8b1e220d5a2
|
| 3 |
+
size 8087116
|
ctfidf_config.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
topic_embeddings.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c1175de09bb6ff5e71b320a2d0ab51a1806de1ebafa91204cd3f95fce2eb5aac
|
| 3 |
+
size 184408
|
topics.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|