D0men1c0 commited on
Commit
f842177
·
verified ·
1 Parent(s): c45aeec

Add BERTopic model

Browse files
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ tags:
4
+ - bertopic
5
+ library_name: bertopic
6
+ pipeline_tag: text-classification
7
+ ---
8
+
9
+ # ISSR_Dark_Web_7Topics
10
+
11
+ This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
12
+ BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
13
+
14
+ ## Usage
15
+
16
+ To use this model, please install BERTopic:
17
+
18
+ ```
19
+ pip install -U bertopic
20
+ ```
21
+
22
+ You can use the model as follows:
23
+
24
+ ```python
25
+ from bertopic import BERTopic
26
+ topic_model = BERTopic.load("D0men1c0/ISSR_Dark_Web_7Topics")
27
+
28
+ topic_model.get_topic_info()
29
+ ```
30
+
31
+ ## Topic overview
32
+
33
+ * Number of topics: 8
34
+ * Number of training documents: 65529
35
+
36
+ <details>
37
+ <summary>Click here for an overview of all topics.</summary>
38
+
39
+ | Topic ID | Topic Keywords | Topic Frequency | Label |
40
+ |----------|----------------|-----------------|-------|
41
+ | -1 | anyone - new - help - free - please | 2823 | -1_anyone_new_help_free |
42
+ | 0 | weed - xanax - vendor - cocaine - mg | 27613 | Drug Vendor Europe |
43
+ | 1 | market - empire - dream - nightmare - vendor | 8645 | Dream Vendor Nightmare |
44
+ | 2 | vendor - scammer - scam - looking - scamming | 6236 | Trusted Vendor Scams |
45
+ | 3 | review - vendor review - vendor - review vendor - review review | 6907 | Vendor MDMA Review |
46
+ | 4 | mdma - lsd - get - looking - wsm | 4230 | Drug Discussion |
47
+ | 5 | order - package - shipping - delivery - pack | 6299 | Order Shipping & Tracking |
48
+ | 6 | bitcoin - card - wallet - btc - bank | 2776 | Financial Services and Products |
49
+
50
+ </details>
51
+
52
+ ## Training hyperparameters
53
+
54
+ * calculate_probabilities: False
55
+ * language: None
56
+ * low_memory: False
57
+ * min_topic_size: 10
58
+ * n_gram_range: (1, 2)
59
+ * nr_topics: None
60
+ * seed_topic_list: [['tor site', 'drug', 'cocaine', 'ketamine', 'weed', 'trafficking', 'scammer', 'market', 'vendor', 'bitcoin', 'mdma', 'coke', 'lsd', 'heroine', 'xanax', 'tor node', 'tor site', 'gun', 'weapon', 'hacking']]
61
+ * top_n_words: 10
62
+ * verbose: True
63
+ * zeroshot_min_similarity: 0.05
64
+ * zeroshot_topic_list: [['burglary', 'buy drugs', 'buy weapons', 'child abuse', 'check sale', 'corruption', 'counterfeit money', 'drugs', 'espionage', 'fake IDs', 'find vendor', 'fraud', 'gun', 'hacking', 'kidnapping', 'murder', 'organ trafficking', 'pedophilia', 'rape', 'scammer', 'sell drugs', 'terrorism', 'trafficking']]
65
+
66
+ ## Framework versions
67
+
68
+ * Numpy: 1.26.4
69
+ * HDBSCAN: 0.8.36
70
+ * UMAP: 0.5.6
71
+ * Pandas: 2.2.1
72
+ * Scikit-Learn: 1.4.1.post1
73
+ * Sentence-transformers: 3.0.1
74
+ * Transformers: 4.39.3
75
+ * Numba: 0.60.0
76
+ * Plotly: 5.22.0
77
+ * Python: 3.12.2
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "calculate_probabilities": false,
3
+ "language": null,
4
+ "low_memory": false,
5
+ "min_topic_size": 10,
6
+ "n_gram_range": [
7
+ 1,
8
+ 2
9
+ ],
10
+ "nr_topics": null,
11
+ "seed_topic_list": [
12
+ [
13
+ "tor site",
14
+ "drug",
15
+ "cocaine",
16
+ "ketamine",
17
+ "weed",
18
+ "trafficking",
19
+ "scammer",
20
+ "market",
21
+ "vendor",
22
+ "bitcoin",
23
+ "mdma",
24
+ "coke",
25
+ "lsd",
26
+ "heroine",
27
+ "xanax",
28
+ "tor node",
29
+ "tor site",
30
+ "gun",
31
+ "weapon",
32
+ "hacking"
33
+ ]
34
+ ],
35
+ "top_n_words": 10,
36
+ "verbose": true,
37
+ "zeroshot_min_similarity": 0.05,
38
+ "zeroshot_topic_list": [
39
+ [
40
+ "burglary",
41
+ "buy drugs",
42
+ "buy weapons",
43
+ "child abuse",
44
+ "check sale",
45
+ "corruption",
46
+ "counterfeit money",
47
+ "drugs",
48
+ "espionage",
49
+ "fake IDs",
50
+ "find vendor",
51
+ "fraud",
52
+ "gun",
53
+ "hacking",
54
+ "kidnapping",
55
+ "murder",
56
+ "organ trafficking",
57
+ "pedophilia",
58
+ "rape",
59
+ "scammer",
60
+ "sell drugs",
61
+ "terrorism",
62
+ "trafficking"
63
+ ]
64
+ ],
65
+ "embedding_model": "distiluse-base-multilingual-cased-v1"
66
+ }
ctfidf.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c2a5e3b58b6a822acb4975ea44011a7ba329b06f1fbb14c18455bc6427f1e18
3
+ size 5155408
ctfidf_config.json ADDED
The diff for this file is too large to render. See raw diff
 
topic_embeddings.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de0c7b704f576f6c36ffc3037ff83c1a61f22d80f9bf766faa0e9fd887ecaeea
3
+ size 16472
topics.json ADDED
The diff for this file is too large to render. See raw diff