ehsanaghaei
/

SecureBERT

@@ -2,50 +2,61 @@
 license: bigscience-openrail-m
 widget:
 - text: >-
-    Native API functions such as <mask>, may be directed invoked via system
-    calls/syscalls, but these features are also often exposed to user-mode
-    applications via interfaces and libraries..
   example_title: Native API functions
 - text: >-
-    One way of explicitly assigning the PPID of a new process is via the <mask>
-    API call, which supports a parameter that defines the PPID to use.
   example_title: Assigning the PPID of a new process
 - text: >-
-    Enable Safe DLL Search Mode to force search for system DLLs in directories
-    with greater restrictions (e.g. %<mask>%) to be used before local directory
-    DLLs (e.g. a user's home directory)
   example_title: Enable Safe DLL Search Mode
 - text: >-
-    GuLoader is a file downloader that has been used since at least December
-    2019 to distribute a variety of <mask>, including NETWIRE, Agent Tesla,
-    NanoCore, and FormBook.
   example_title: GuLoader is a file downloader
 language:
 - en
 tags:
 - cybersecurity
-- cyber threat intellignece
 ---
-# SecureBERT: A Domain-Specific Language Model for Cybersecurity
-SecureBERT is a domain-specific language model based on RoBERTa which is trained on a huge amount of cybersecurity data and fine-tuned/tweaked to understand/represent cybersecurity textual data.
-[SecureBERT](https://link.springer.com/chapter/10.1007/978-3-031-25538-0_3) is a domain-specific language model to represent cybersecurity textual data which is trained on a large amount of in-domain text crawled from online resources. ***See the presentation on [YouTube](https://www.youtube.com/watch?v=G8WzvThGG8c&t=8s)***
-See details at [GitHub Repo](https://github.com/ehsanaghaei/SecureBERT/blob/main/README.md)
 ![image](https://user-images.githubusercontent.com/46252665/195998237-9bbed621-8002-4287-ac0d-19c4f603d919.png)
-## SecureBERT can be used as the base model for any downstream task including text classification, NER, Seq-to-Seq, QA, etc.
-* SecureBERT has demonstrated significantly higher performance in predicting masked words within the text when compared to existing models like RoBERTa (base and large), SciBERT, and SecBERT.
-* SecureBERT has also demonstrated promising performance in preserving general English language understanding (representation).
-# How to use SecureBERT
-SecureBERT has been uploaded to [Huggingface](https://huggingface.co/ehsanaghaei/SecureBERT) framework. You may use the code below
 ```python
 from transformers import RobertaTokenizer, RobertaModel
 import torch
@@ -58,63 +69,45 @@ outputs = model(**inputs)
 last_hidden_states = outputs.last_hidden_state
-## Fill Mask
-SecureBERT has been trained on MLM. Use the code below to predict the masked word within the given sentences:
-```python
-#!pip install transformers
-#!pip install torch
-#!pip install tokenizers
 import torch
 import transformers
-from transformers import RobertaTokenizer, RobertaTokenizerFast
 tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT")
 model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT")
-def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
     token_ids = tokenizer.encode(sent, return_tensors='pt')
-    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
-    masked_pos = [mask.item() for mask in masked_position]
     words = []
     with torch.no_grad():
         output = model(token_ids)
-    last_hidden_state = output[0].squeeze()
-    list_of_list = []
-    for index, mask_index in enumerate(masked_pos):
-        mask_hidden_state = last_hidden_state[mask_index]
-        idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
-        words = [tokenizer.decode(i.item()).strip() for i in idx]
-        words = [w.replace(' ','') for w in words]
-        list_of_list.append(words)
         if print_results:
-            print("Mask ", "Predictions : ", words)
-    best_guess = ""
-    for j in list_of_list:
-        best_guess = best_guess + "," + j[0]
     return words
-while True:
-    sent = input("Text here: \t")
-    print("SecureBERT: ")
-    predict_mask(sent, tokenizer, model)
-    print("===========================\n")
-```
 # Reference
 @inproceedings{aghaei2023securebert,
 title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
 author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
 booktitle={Security and Privacy in Communication Networks:
-18th EAI International Conference, SecureComm 2022, Virtual Event,
-October 2022,
-Proceedings},
 pages={39--56},
 year={2023},
-organization={Springer} }

 license: bigscience-openrail-m
 widget:
 - text: >-
+    Native API functions such as <mask> may be directly invoked via system
+    calls (syscalls). However, these features are also commonly exposed to
+    user-mode applications through interfaces and libraries.
   example_title: Native API functions
 - text: >-
+    One way to explicitly assign the PPID of a new process is through the
+    <mask> API call, which includes a parameter for defining the PPID.
   example_title: Assigning the PPID of a new process
 - text: >-
+    Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
+    directories (e.g., %<mask>%) are prioritized over DLLs in less secure
+    locations such as a user’s home directory.
   example_title: Enable Safe DLL Search Mode
 - text: >-
+    GuLoader is a file downloader that has been active since at least December
+    2019. It has been used to distribute a variety of <mask>, including
+    NETWIRE, Agent Tesla, NanoCore, and FormBook.
   example_title: GuLoader is a file downloader
 language:
 - en
 tags:
 - cybersecurity
+- cyber threat intelligence
 ---
+# SecureBERT: A Domain-Specific Language Model for Cybersecurity
+**SecureBERT** is a RoBERTa-based, domain-specific language model trained on a large cybersecurity-focused corpus. It is designed to represent and understand cybersecurity text more effectively than general-purpose models.
+[SecureBERT](https://link.springer.com/chapter/10.1007/978-3-031-25538-0_3) was trained on extensive in-domain data crawled from diverse online resources. It has demonstrated strong performance in a range of cybersecurity NLP tasks.
+👉 See the [presentation on YouTube](https://www.youtube.com/watch?v=G8WzvThGG8c&t=8s).
+👉 Explore details on the [GitHub repository](https://github.com/ehsanaghaei/SecureBERT/blob/main/README.md).
 ![image](https://user-images.githubusercontent.com/46252665/195998237-9bbed621-8002-4287-ac0d-19c4f603d919.png)
+---
+## Applications
+SecureBERT can be used as a base model for downstream NLP tasks in cybersecurity, including:
+- Text classification
+- Named Entity Recognition (NER)
+- Sequence-to-sequence tasks
+- Question answering
+### Key Results
+- Outperforms baseline models such as **RoBERTa (base/large)**, **SciBERT**, and **SecBERT** in masked language modeling tasks within the cybersecurity domain.
+- Maintains strong performance in **general English language understanding**, ensuring broad usability beyond domain-specific tasks.
+---
+## Using SecureBERT
+The model is available on [Hugging Face](https://huggingface.co/ehsanaghaei/SecureBERT).
+### Load the Model
 ```python
 from transformers import RobertaTokenizer, RobertaModel
 import torch
 last_hidden_states = outputs.last_hidden_state
+Masked Language Modeling Example
+SecureBERT is trained with Masked Language Modeling (MLM). Use the following example to predict masked tokens:
+#!pip install transformers torch tokenizers
 import torch
 import transformers
+from transformers import RobertaTokenizerFast
 tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT")
 model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT")
+def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
     token_ids = tokenizer.encode(sent, return_tensors='pt')
+    masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
     words = []
     with torch.no_grad():
         output = model(token_ids)
+    for pos in masked_pos:
+        logits = output.logits[0, pos]
+        top_tokens = torch.topk(logits, k=topk).indices
+        predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
+        words.append(predictions)
         if print_results:
+            print(f"Mask Predictions: {predictions}")
     return words
 # Reference
 @inproceedings{aghaei2023securebert,
 title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
 author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
 booktitle={Security and Privacy in Communication Networks:
+18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings},
 pages={39--56},
 year={2023},
+organization={Springer}
+}