Is it "dark" bert really?

#1
by ga - opened

"a team of researchers from KAIST in South Korea and S2W Inc. developed DarkBERT, a new language model pre-trained on a Dark Web corpus."

Tthis model is not that model, apparently. Can someone explain this models name then?

--- Query and Response ---

Query: Print the official website of the conference where DarkBERT AI model can be downloaded? Is the conference website the only place from where AI model can be downloaded?

Response: The official website of the conference where darkbert ai model can be downloaded is not specified in the text. it is not known if the conference website is the only place from where ai model can be downloaded.

Source(s) used for the answer:
Source 1:
datasets are 8More information on the sharing of the datasets and the model itself will be released during the conference only to be used for academic research purposes We adhere to this guideline and only utilize the pro vided data in the context of research for this work On the other hand we do not plan to publicly re lease the Dark Web text corpus used for pretraining DarkBERT for similar reasons Limitations Limited Usage for Non English Tasks As mentioned in Section 3 DarkBERT is pretrained using Dark Web texts in English This is mainly our design choice as the vast majority around 90 of Dark Web texts is primarily in English Jin et al 2022 We believe that with the limited number of collected pages in non English languages in our pretraining corpus building a multilingual lan guage model for the Dark Web domain would pose additional challenges such as downstream task evaluations becoming more difficult to perform as they would require high quality annotations of task specific datasets in multiple languages As such while our language model is suitable for Dark Web tasks in English further pretraining with language specific data may be necessary to use DarkBERT for non English tasks Dependence on Task Specific Data Although DarkBERT is a useful tool that can be di rectly applied to many existing Dark Web domain specific tasks some tasks may require further fine tuning through task specific data as seen in Ransomware Leak Site Detection and Noteworthy Thread Detection use case scenarios in Section 5 However there is a shortage of publicly available Dark Web task specific data While we provide the datasets used to fine tune DarkBERT in this paper additional research on tasks that do not have readily available datasets for use may require further man ual annotation or handcrafting of necessary data to leverage DarkBERT to its maximum potential Acknowledgements This work was supported by Institute of Informa tion Communications Technology Planning Evaluation IITP grant funded by the Korea gov ernment MSIT No 2022 0 00740 The Devel opment of Darkweb Hidden Service Identification and Real IP Trace Technology
Source 2:
DarkBERT A Language Model for the Dark Side of the Internet Youngjin Jin1Eugene Jang2Jian Cui2Jin Woo Chung2Yongjae Lee2Seungwon Shin1 1KAIST Daejeon South Korea 2S2W Inc Seongnam South Korea 1 ijinjin claude kaist ac kr 2 genesith geeoon19 jwchung lee s2w inc Abstract Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web As studies on the Dark Web commonly re quire textual analysis of the domain language models specific to the Dark Web may provide valuable insights to researchers In this work we introduce DarkBERT a language model pretrained on Dark Web data We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the Dark Web that may be detrimental to build ing a proper representation of the domain We evaluate DarkBERT and its vanilla counterpart along with other widely used language mod els to validate the benefits that a Dark Web do main specific model offers in various use cases Our evaluations show that DarkBERT outper forms current language models and may serve as a valuable resource for future research on the Dark Web 1 Introduction TheDark Web is a subset of the Internet that is not indexed by web search engines such as Google and is inaccessible through a standard web browser To access the Dark Web specialized overlay network applications such as Tor The Onion Router Din gledine et al 2004 are required Tor also hosts hidden services onion services web services in which the client and the server IP addresses are hidden from each other Biryukov et al 2013 This sense of identity obscurity provided to the Dark Web users comes with a catch many of the underground activities prevalent in the Dark Web are immoral illegal in nature ranging from content hosting such as data leaks to drug sales Al Nabki et al 2017 Jin et al 2022 As such the pop ularity of the Dark Web as a platform of choice for malicious activities has garnered interest from researchers and security experts alike To handle the ever changing landscape of mod ern cyber threats cybersecurity experts and re searchers have started to employ natural language processing NLP methods Gaining evidence based knowledge such as indicators of compro mise IOC to mitigate emerging threats is an inte gral part of modern cybersecurity known as cyber
Source 3:
Table 4 Ransomware leak site detection performance Boldface indicates the best performance Input Model Precision Recall F1 score RawBERT cased 75 83 69 52 71 01 BERT uncased 77 18 73 90 72 77 RoBERTa 39 83 36 00 36 27 DarkBERT raw 78 81 83 62 79 98 PreprocessedBERT cased 76 81 68 19 70 13 BERT uncased 71 97 71 62 70 77 RoBERTa 48 36 45 14 44 31 DarkBERT preproc 85 16 84 57 84 11 to April 2022 and periodically download HTML files from these sites especially when new victims are revealed4 Leak sites typically contain the vic tim organization name descriptions of leaked data and threat statements with sample data refer to Figure 3a for an example leak site page We collect pages by randomly choosing a maxi mum of three pages with different page titles from each of the 54 leak sites and label them as positive examples To create negative data rather than col lecting random pages in the Dark Web we consider pages with content similar to that of leak sites to make the task more challenging To select such pages we utilize the activity category classifier from Section B used for balancing the pretraining corpus The intuition behind using the activity clas sifier to select negative data is that the text content of certain categories like Hacking are more similar to that of leak sites than other less relevant cate gories such as Pornography andGambling Our pilot study suggests leak sites are mostly classified by the activity classifier as Hacking followed by Cryptocurrency Financial and Others Thus we only collect Dark Web pages that are classified into one of these four categories and treat them as neg ative examples Our training text data consists of 105 positive and 679 negative examples pages Training is done using 5 fold cross validation Results and Discussion As shown in Table 4 DarkBERT outperforms other language models demonstrating the advantages of DarkBERT in un derstanding the language of underground hacking forums on the Dark Web Figure 3a shows a leak site sample correctly classified by DarkBERT but 4URLs of such leak sites can be found in cy bersecurity news social media open source repos itories and so on We used URLs taken from https github com fastfire deepdarkCTI blob main ransomware_gang md Table 5 Noteworthy thread detection performance Boldface indicates best performance Input Model Precision Recall F1 score RawBERT cased 55 09 19 91 26 90 BERT uncased 52 34 23

Sign up or log in to comment