Add BERTopic model
Browse files- README.md +77 -0
- config.json +66 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
---
|
| 3 |
+
tags:
|
| 4 |
+
- bertopic
|
| 5 |
+
library_name: bertopic
|
| 6 |
+
pipeline_tag: text-classification
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# ISSR_Dark_Web_7Topics
|
| 10 |
+
|
| 11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
| 12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
| 13 |
+
|
| 14 |
+
## Usage
|
| 15 |
+
|
| 16 |
+
To use this model, please install BERTopic:
|
| 17 |
+
|
| 18 |
+
```
|
| 19 |
+
pip install -U bertopic
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
You can use the model as follows:
|
| 23 |
+
|
| 24 |
+
```python
|
| 25 |
+
from bertopic import BERTopic
|
| 26 |
+
topic_model = BERTopic.load("D0men1c0/ISSR_Dark_Web_7Topics")
|
| 27 |
+
|
| 28 |
+
topic_model.get_topic_info()
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
## Topic overview
|
| 32 |
+
|
| 33 |
+
* Number of topics: 8
|
| 34 |
+
* Number of training documents: 65529
|
| 35 |
+
|
| 36 |
+
<details>
|
| 37 |
+
<summary>Click here for an overview of all topics.</summary>
|
| 38 |
+
|
| 39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
| 40 |
+
|----------|----------------|-----------------|-------|
|
| 41 |
+
| -1 | anyone - new - help - free - please | 2823 | -1_anyone_new_help_free |
|
| 42 |
+
| 0 | weed - xanax - vendor - cocaine - mg | 27613 | Drug Vendor Europe |
|
| 43 |
+
| 1 | market - empire - dream - nightmare - vendor | 8645 | Dream Vendor Nightmare |
|
| 44 |
+
| 2 | vendor - scammer - scam - looking - scamming | 6236 | Trusted Vendor Scams |
|
| 45 |
+
| 3 | review - vendor review - vendor - review vendor - review review | 6907 | Vendor MDMA Review |
|
| 46 |
+
| 4 | mdma - lsd - get - looking - wsm | 4230 | Drug Discussion |
|
| 47 |
+
| 5 | order - package - shipping - delivery - pack | 6299 | Order Shipping & Tracking |
|
| 48 |
+
| 6 | bitcoin - card - wallet - btc - bank | 2776 | Financial Services and Products |
|
| 49 |
+
|
| 50 |
+
</details>
|
| 51 |
+
|
| 52 |
+
## Training hyperparameters
|
| 53 |
+
|
| 54 |
+
* calculate_probabilities: False
|
| 55 |
+
* language: None
|
| 56 |
+
* low_memory: False
|
| 57 |
+
* min_topic_size: 10
|
| 58 |
+
* n_gram_range: (1, 2)
|
| 59 |
+
* nr_topics: None
|
| 60 |
+
* seed_topic_list: [['tor site', 'drug', 'cocaine', 'ketamine', 'weed', 'trafficking', 'scammer', 'market', 'vendor', 'bitcoin', 'mdma', 'coke', 'lsd', 'heroine', 'xanax', 'tor node', 'tor site', 'gun', 'weapon', 'hacking']]
|
| 61 |
+
* top_n_words: 10
|
| 62 |
+
* verbose: True
|
| 63 |
+
* zeroshot_min_similarity: 0.05
|
| 64 |
+
* zeroshot_topic_list: [['burglary', 'buy drugs', 'buy weapons', 'child abuse', 'check sale', 'corruption', 'counterfeit money', 'drugs', 'espionage', 'fake IDs', 'find vendor', 'fraud', 'gun', 'hacking', 'kidnapping', 'murder', 'organ trafficking', 'pedophilia', 'rape', 'scammer', 'sell drugs', 'terrorism', 'trafficking']]
|
| 65 |
+
|
| 66 |
+
## Framework versions
|
| 67 |
+
|
| 68 |
+
* Numpy: 1.26.4
|
| 69 |
+
* HDBSCAN: 0.8.36
|
| 70 |
+
* UMAP: 0.5.6
|
| 71 |
+
* Pandas: 2.2.1
|
| 72 |
+
* Scikit-Learn: 1.4.1.post1
|
| 73 |
+
* Sentence-transformers: 3.0.1
|
| 74 |
+
* Transformers: 4.39.3
|
| 75 |
+
* Numba: 0.60.0
|
| 76 |
+
* Plotly: 5.22.0
|
| 77 |
+
* Python: 3.12.2
|
config.json
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"calculate_probabilities": false,
|
| 3 |
+
"language": null,
|
| 4 |
+
"low_memory": false,
|
| 5 |
+
"min_topic_size": 10,
|
| 6 |
+
"n_gram_range": [
|
| 7 |
+
1,
|
| 8 |
+
2
|
| 9 |
+
],
|
| 10 |
+
"nr_topics": null,
|
| 11 |
+
"seed_topic_list": [
|
| 12 |
+
[
|
| 13 |
+
"tor site",
|
| 14 |
+
"drug",
|
| 15 |
+
"cocaine",
|
| 16 |
+
"ketamine",
|
| 17 |
+
"weed",
|
| 18 |
+
"trafficking",
|
| 19 |
+
"scammer",
|
| 20 |
+
"market",
|
| 21 |
+
"vendor",
|
| 22 |
+
"bitcoin",
|
| 23 |
+
"mdma",
|
| 24 |
+
"coke",
|
| 25 |
+
"lsd",
|
| 26 |
+
"heroine",
|
| 27 |
+
"xanax",
|
| 28 |
+
"tor node",
|
| 29 |
+
"tor site",
|
| 30 |
+
"gun",
|
| 31 |
+
"weapon",
|
| 32 |
+
"hacking"
|
| 33 |
+
]
|
| 34 |
+
],
|
| 35 |
+
"top_n_words": 10,
|
| 36 |
+
"verbose": true,
|
| 37 |
+
"zeroshot_min_similarity": 0.05,
|
| 38 |
+
"zeroshot_topic_list": [
|
| 39 |
+
[
|
| 40 |
+
"burglary",
|
| 41 |
+
"buy drugs",
|
| 42 |
+
"buy weapons",
|
| 43 |
+
"child abuse",
|
| 44 |
+
"check sale",
|
| 45 |
+
"corruption",
|
| 46 |
+
"counterfeit money",
|
| 47 |
+
"drugs",
|
| 48 |
+
"espionage",
|
| 49 |
+
"fake IDs",
|
| 50 |
+
"find vendor",
|
| 51 |
+
"fraud",
|
| 52 |
+
"gun",
|
| 53 |
+
"hacking",
|
| 54 |
+
"kidnapping",
|
| 55 |
+
"murder",
|
| 56 |
+
"organ trafficking",
|
| 57 |
+
"pedophilia",
|
| 58 |
+
"rape",
|
| 59 |
+
"scammer",
|
| 60 |
+
"sell drugs",
|
| 61 |
+
"terrorism",
|
| 62 |
+
"trafficking"
|
| 63 |
+
]
|
| 64 |
+
],
|
| 65 |
+
"embedding_model": "distiluse-base-multilingual-cased-v1"
|
| 66 |
+
}
|
ctfidf.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6c2a5e3b58b6a822acb4975ea44011a7ba329b06f1fbb14c18455bc6427f1e18
|
| 3 |
+
size 5155408
|
ctfidf_config.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
topic_embeddings.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:de0c7b704f576f6c36ffc3037ff83c1a61f22d80f9bf766faa0e9fd887ecaeea
|
| 3 |
+
size 16472
|
topics.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|