Add BERTopic model
Browse files- README.md +138 -0
- config.json +17 -0
- ctfidf.safetensors +3 -0
- ctfidf_config.json +0 -0
- topic_embeddings.safetensors +3 -0
- topics.json +0 -0
README.md
ADDED
|
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
---
|
| 3 |
+
tags:
|
| 4 |
+
- bertopic
|
| 5 |
+
library_name: bertopic
|
| 6 |
+
pipeline_tag: text-classification
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# ISSR_Dark_Web_68Topics
|
| 10 |
+
|
| 11 |
+
This is a [BERTopic](https://github.com/MaartenGr/BERTopic) model.
|
| 12 |
+
BERTopic is a flexible and modular topic modeling framework that allows for the generation of easily interpretable topics from large datasets.
|
| 13 |
+
|
| 14 |
+
## Usage
|
| 15 |
+
|
| 16 |
+
To use this model, please install BERTopic:
|
| 17 |
+
|
| 18 |
+
```
|
| 19 |
+
pip install -U bertopic
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
You can use the model as follows:
|
| 23 |
+
|
| 24 |
+
```python
|
| 25 |
+
from bertopic import BERTopic
|
| 26 |
+
topic_model = BERTopic.load("D0men1c0/ISSR_Dark_Web_68Topics")
|
| 27 |
+
|
| 28 |
+
topic_model.get_topic_info()
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
## Topic overview
|
| 32 |
+
|
| 33 |
+
* Number of topics: 69
|
| 34 |
+
* Number of training documents: 65529
|
| 35 |
+
|
| 36 |
+
<details>
|
| 37 |
+
<summary>Click here for an overview of all topics.</summary>
|
| 38 |
+
|
| 39 |
+
| Topic ID | Topic Keywords | Topic Frequency | Label |
|
| 40 |
+
|----------|----------------|-----------------|-------|
|
| 41 |
+
| -1 | anyone - get - update - review - new | 178 | outliers |
|
| 42 |
+
| 0 | weed - cannabis - cart - thc - review | 18900 | Cannabis Weed Vape Cart Reviews |
|
| 43 |
+
| 1 | help - guy - sub - need - back | 5021 | Subreddit Help Needed |
|
| 44 |
+
| 2 | order - shipping - package - pack - delivery | 2035 | USPS Package Delivery |
|
| 45 |
+
| 3 | empire - empire market - empire empire - market - deposit | 2005 | Empire Market Deposit Issues |
|
| 46 |
+
| 4 | vendor - vendor vendor - vendor inquiry - inquiry - new vendor | 1815 | Trusted Vendor Inquiries |
|
| 47 |
+
| 5 | scammer - scam - exit - scamming - scammed | 3142 | Exit Scammer Warning |
|
| 48 |
+
| 6 | darknet - dark - web - dark web - darkfail | 1649 | Dark Web Drug Trafficking Arrests |
|
| 49 |
+
| 7 | mdma - mda - mdma vendor - domestic - usa | 1390 | MDMA Vendor USA Sale |
|
| 50 |
+
| 8 | xanax - mg - diazepam - xanax vendor - valium | 1319 | Xanax Vendor Xanax Bars |
|
| 51 |
+
| 9 | lsd - ug - tab - lsd vendor - acid | 1233 | LSD Vendor Tab List |
|
| 52 |
+
| 10 | crosspost - giveaway - review crosspost - crosspost vendor - review | 1012 | Giveaway Crossposts |
|
| 53 |
+
| 11 | monero - btc - bitcoin - coin - wallet | 999 | Monero Exchange |
|
| 54 |
+
| 12 | carding - card - credit - credit card - debit | 933 | Credit Card Service |
|
| 55 |
+
| 13 | dream - dream market - nightmare - market - dream dream | 909 | Dream Market |
|
| 56 |
+
| 14 | dispute - mod - moderator - dispute dispute - please | 882 | Dispute resolution |
|
| 57 |
+
| 15 | cocaine - cocaine vendor - fishscale - peruvian - colombian | 763 | Cocaine Vendor Fish |
|
| 58 |
+
| 16 | review - vendor review - review vendor - vendor - review review | 771 | Vendor Reviews Product |
|
| 59 |
+
| 17 | market - market market - new market - markets - marketplace | 998 | New Market Core |
|
| 60 |
+
| 18 | pgp - key - pgp key - public - public pgp | 1020 | Public PGP Key |
|
| 61 |
+
| 19 | deposit - deposited - ticket - address - double | 654 | Double Deposit Support Ticket |
|
| 62 |
+
| 20 | bar - bunk - bars - selaminy - hulk | 646 | Bar Reviews |
|
| 63 |
+
| 21 | oxycodone - mg - oxy - opiate - opiateconnect | 590 | Oxycodone and Dilaudid Purchase |
|
| 64 |
+
| 22 | id - passport - fake - fake id - license | 608 | Fake IDs and Licenses |
|
| 65 |
+
| 23 | drug - drugsuk - drugs - selling drug - drug dealer | 544 | Drug Misadvertizing Risks |
|
| 66 |
+
| 24 | coke - coke vendor - best coke - uk coke - uk | 556 | Coke Topic |
|
| 67 |
+
| 25 | pill - xtc - xtc pill - ecstasy - pills | 544 | xtc pills for sale |
|
| 68 |
+
| 26 | counterfeit - note - euro - money - counterfeit money | 509 | Counterfeit Notes |
|
| 69 |
+
| 27 | ketamine - ketamine vendor - ketamine review - mdma ketamine - review ketamine | 483 | Ketamine Vendor |
|
| 70 |
+
| 28 | wsm - wsm wsm - wsm vendor - vendor wsm - wsm order | 467 | WSM Vendor |
|
| 71 |
+
| 29 | meth - crystal meth - crystal - meth vendor - methamphetamine | 485 | Crystal Meth Vendor Review |
|
| 72 |
+
| 30 | ticket - support ticket - support - please - ticket support | 479 | Ticket Support |
|
| 73 |
+
| 31 | hacked - hacker - hacking - job - lfw | 480 | Hacker Developer Job Exploits |
|
| 74 |
+
| 32 | login - account - password - log - fa | 471 | Login Issues |
|
| 75 |
+
| 33 | adderall - mg - ir - ritalin - vyvanse | 470 | Adderall Pharmacy Brand Name |
|
| 76 |
+
| 34 | xmr - btc xmr - btc - xmrto - xmr btc | 474 | xmr xmr |
|
| 77 |
+
| 35 | tails - tail - electrum - wallet - whonix | 433 | Tails Electrum Monero Wallet Issue |
|
| 78 |
+
| 36 | mushroom - mushrooms - shrooms - magic - cubensis | 393 | Mushrooms Magic Penis |
|
| 79 |
+
| 37 | dread - dread dread - cafe dread - cafe - dread word | 371 | Cafe Dread Topics |
|
| 80 |
+
| 38 | cc - cvv - vbv - cc vendor - cc cvv | 419 | cc vending |
|
| 81 |
+
| 39 | cryptonia - cryptonia market - cryptonia cryptonia - dcdutchconnectionuk - empire cryptonia | 382 | Cryptonia Vendor |
|
| 82 |
+
| 40 | withdraw - withdrawal - withdrawl - withdraws - btc | 375 | Withdrawal Issues BTC |
|
| 83 |
+
| 41 | escrow - multisig - escrow escrow - full escrow - escrow order | 397 | Escrow Services and Multisig |
|
| 84 |
+
| 42 | heroin - heroin vendor - afghan - afghan heroin - synthetic heroin | 335 | Afghan Heroin Sale |
|
| 85 |
+
| 43 | de - har - noen - som - fra | 378 | Discussion Topics |
|
| 86 |
+
| 44 | dnm - dnms - dn - bible - dnstars | 341 | DNMS Bible |
|
| 87 |
+
| 45 | wallstreet - wall - wall street - street - wall st | 339 | Wall Street Market |
|
| 88 |
+
| 46 | ddos - ddos attack - attack - ddos ddos - ddos attacks | 306 | DDOS Attacks |
|
| 89 |
+
| 47 | paypal - transfer - paypal transfer - paypal account - western union | 283 | PayPal Transfer Scams |
|
| 90 |
+
| 48 | heard - happened - anyone - anyone heard - thewizzardnl | 327 | Document Mentions |
|
| 91 |
+
| 49 | benzos - benzo - rc - benzo vendor - rc benzos | 281 | Benzos Vendors |
|
| 92 |
+
| 50 | fraud - fraudsters - fraud vendor - loan fraud - fraudfox | 308 | Fraud Vendor Loan |
|
| 93 |
+
| 51 | dream - dream vendor - dream market - vendor dream - vendor | 326 | Dream Market Vendor Inquiry |
|
| 94 |
+
| 52 | order - cancel - cancelled - refund - cancel order | 497 | Order Cancelled |
|
| 95 |
+
| 53 | bank - bank log - bank drop - log - bank account | 331 | Bank Fraud Cards |
|
| 96 |
+
| 54 | onion - onion site - site - onion link - onion list | 328 | Onion links |
|
| 97 |
+
| 55 | phishing - phishing link - phished - link - warning | 245 | Phishing Warning |
|
| 98 |
+
| 56 | apollon - apollon market - market - apollon apollon - mysteryland | 253 | Apollon Market |
|
| 99 |
+
| 57 | opsec - opsec opsec - opsec question - bad opsec - question | 242 | Opsec and Guides |
|
| 100 |
+
| 58 | link - working link - working - pm - link please | 226 | PM Working Share Links |
|
| 101 |
+
| 59 | mirror - working mirror - working - mirror link - empire mirror | 229 | mirror link |
|
| 102 |
+
| 60 | fentanyl - fent - carfentanil - selling fentanyl - analogue | 219 | Fentanyl |
|
| 103 |
+
| 61 | cgmc - invite - invite code - code - cgmc invite | 221 | Invite Code CGMC |
|
| 104 |
+
| 62 | alprazolam - powder - alprazolam powder - flualprazolam - etizolam | 211 | Alprazolam Powder |
|
| 105 |
+
| 63 | dmt - dmt vendor - dmt vape - odsmt - dmt dmt | 290 | DMT Vendors |
|
| 106 |
+
| 64 | captcha - rapture - rapture market - captcha captcha - incorrect | 202 | Rapture Market Captcha |
|
| 107 |
+
| 65 | chemical - research - research chemical - chems - research chemicals | 187 | Research Chemicals |
|
| 108 |
+
| 66 | tor - tor browser - browser - tor network - network | 198 | Tor Browser Research |
|
| 109 |
+
| 67 | mephedrone - meopcp - mxe - mescaline - mmc | 222 | Mephedrone |
|
| 110 |
+
|
| 111 |
+
</details>
|
| 112 |
+
|
| 113 |
+
## Training hyperparameters
|
| 114 |
+
|
| 115 |
+
* calculate_probabilities: True
|
| 116 |
+
* language: None
|
| 117 |
+
* low_memory: False
|
| 118 |
+
* min_topic_size: 10
|
| 119 |
+
* n_gram_range: (1, 2)
|
| 120 |
+
* nr_topics: None
|
| 121 |
+
* seed_topic_list: None
|
| 122 |
+
* top_n_words: 10
|
| 123 |
+
* verbose: True
|
| 124 |
+
* zeroshot_min_similarity: 0.7
|
| 125 |
+
* zeroshot_topic_list: None
|
| 126 |
+
|
| 127 |
+
## Framework versions
|
| 128 |
+
|
| 129 |
+
* Numpy: 1.26.4
|
| 130 |
+
* HDBSCAN: 0.8.36
|
| 131 |
+
* UMAP: 0.5.6
|
| 132 |
+
* Pandas: 2.2.1
|
| 133 |
+
* Scikit-Learn: 1.4.1.post1
|
| 134 |
+
* Sentence-transformers: 3.0.1
|
| 135 |
+
* Transformers: 4.39.3
|
| 136 |
+
* Numba: 0.60.0
|
| 137 |
+
* Plotly: 5.22.0
|
| 138 |
+
* Python: 3.12.2
|
config.json
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"calculate_probabilities": true,
|
| 3 |
+
"language": null,
|
| 4 |
+
"low_memory": false,
|
| 5 |
+
"min_topic_size": 10,
|
| 6 |
+
"n_gram_range": [
|
| 7 |
+
1,
|
| 8 |
+
2
|
| 9 |
+
],
|
| 10 |
+
"nr_topics": null,
|
| 11 |
+
"seed_topic_list": null,
|
| 12 |
+
"top_n_words": 10,
|
| 13 |
+
"verbose": true,
|
| 14 |
+
"zeroshot_min_similarity": 0.7,
|
| 15 |
+
"zeroshot_topic_list": null,
|
| 16 |
+
"embedding_model": "all-MiniLM-L6-v2"
|
| 17 |
+
}
|
ctfidf.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f1ad3ce7fe13070f89799cec9cc3b3127d5638ed1017b995f2c3201bc6e93943
|
| 3 |
+
size 5661292
|
ctfidf_config.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
topic_embeddings.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8fe5935815052732831c8f48547a9a73bdab2727e3ee2e6159734be5d176e196
|
| 3 |
+
size 106072
|
topics.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|