MaximSIMO commited on
Commit
7fb3496
·
verified ·
1 Parent(s): 13f495a

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # bertopic_openai_emb_model
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("MaximSIMO/bertopic_openai_emb_model")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 3
34
+ * Number of training documents: 100
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | Evening TV Programming | 13 | -1_Evening TV Programming |
42
+ | 0 | Elettrodotti e ambiente | 15 | 0_Elettrodotti e ambiente |
43
+ | 1 | Political Tensions | 72 | 1_Political Tensions |
44
+
45
+ </details>
46
+
47
+ ## Training hyperparameters
48
+
49
+ * calculate_probabilities: False
50
+ * language: multilingual
51
+ * low_memory: False
52
+ * min_topic_size: 10
53
+ * n_gram_range: (1, 1)
54
+ * nr_topics: None
55
+ * seed_topic_list: None
56
+ * top_n_words: 10
57
+ * verbose: True
58
+ * zeroshot_min_similarity: 0.7
59
+ * zeroshot_topic_list: None
60
+
61
+ ## Framework versions
62
+
63
+ * Numpy: 2.2.6
64
+ * HDBSCAN: 0.8.41
65
+ * UMAP: 0.5.11
66
+ * Pandas: 2.3.3
67
+ * Scikit-Learn: 1.7.2
68
+ * Sentence-transformers: 5.2.2
69
+ * Transformers: 5.1.0
70
+ * Numba: 0.63.1
71
+ * Plotly: 6.5.2
72
+ * Python: 3.10.19
config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": "multilingual",
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 1
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": null,
12
+ "top_n_words": 10,
13
+ "verbose": true,
14
+ "zeroshot_min_similarity": 0.7,
15
+ "zeroshot_topic_list": null,
16
+ "embedding_model": "text-embedding-3-large"
17
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d2acc8a20bd1102778a6dcc66b127684ac1e7596be40b95386a497a7a0851f6d
3
+ size 749500
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:df710c9fa4325239be3897c4c4e6e8d30b9c0fb62054ec4b30c0b5263888c520
3
+ size 36952
topics.json ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "topic_representations": {
3
+ "-1": [
4
+ [
5
+ "Evening TV Programming",
6
+ 1
7
+ ]
8
+ ],
9
+ "0": [
10
+ [
11
+ "Elettrodotti e ambiente",
12
+ 1
13
+ ]
14
+ ],
15
+ "1": [
16
+ [
17
+ "Political Tensions",
18
+ 1
19
+ ]
20
+ ]
21
+ },
22
+ "topics": [
23
+ 0,
24
+ 0,
25
+ 0,
26
+ 0,
27
+ 0,
28
+ 0,
29
+ 0,
30
+ 0,
31
+ 0,
32
+ 0,
33
+ 0,
34
+ 0,
35
+ 0,
36
+ 0,
37
+ 0,
38
+ 0,
39
+ 0,
40
+ 0,
41
+ 0,
42
+ 0,
43
+ 0,
44
+ 0,
45
+ 0,
46
+ 0,
47
+ 0,
48
+ 0,
49
+ 0,
50
+ 0,
51
+ 0,
52
+ 0,
53
+ 0,
54
+ 0,
55
+ -1,
56
+ 1,
57
+ -1,
58
+ -1,
59
+ 1,
60
+ -1,
61
+ 1,
62
+ 1,
63
+ 0,
64
+ 0,
65
+ -1,
66
+ -1,
67
+ -1,
68
+ -1,
69
+ -1,
70
+ 1,
71
+ 0,
72
+ -1,
73
+ -1,
74
+ 0,
75
+ 0,
76
+ 0,
77
+ 0,
78
+ 0,
79
+ 0,
80
+ 0,
81
+ 0,
82
+ 0,
83
+ 0,
84
+ 0,
85
+ 0,
86
+ 0,
87
+ 0,
88
+ 0,
89
+ 0,
90
+ -1,
91
+ 1,
92
+ 1,
93
+ 1,
94
+ 1,
95
+ -1,
96
+ -1,
97
+ -1,
98
+ 1,
99
+ 0,
100
+ 0,
101
+ 0,
102
+ 0,
103
+ 0,
104
+ 0,
105
+ 1,
106
+ 1,
107
+ 0,
108
+ 1,
109
+ 0,
110
+ 0,
111
+ 0,
112
+ 0,
113
+ 0,
114
+ 0,
115
+ 0,
116
+ 0,
117
+ 0,
118
+ 0,
119
+ 0,
120
+ 0,
121
+ 0,
122
+ 0
123
+ ],
124
+ "topic_sizes": {
125
+ "0": 72,
126
+ "-1": 15,
127
+ "1": 13
128
+ },
129
+ "topic_mapper": [
130
+ [
131
+ -1,
132
+ -1,
133
+ -1
134
+ ],
135
+ [
136
+ 0,
137
+ 0,
138
+ 1
139
+ ],
140
+ [
141
+ 1,
142
+ 1,
143
+ 0
144
+ ]
145
+ ],
146
+ "topic_labels": {
147
+ "-1": "-1_Evening TV Programming",
148
+ "0": "0_Elettrodotti e ambiente",
149
+ "1": "1_Political Tensions"
150
+ },
151
+ "custom_labels": null,
152
+ "_outliers": 1,
153
+ "topic_aspects": {}
154
+ }