ehsanaghaei
/

SecureBERT

@@ -1,12 +1,95 @@
 # SecureBERT: A Domain-Specific Language Model for Cybersecurity
 SecureBERT is a domain-specific language model based on RoBERTa which is trained on a huge amount of cybersecurity data and fine-tuned/tweaked to understand/represent cybersecurity textual data.
-See details at [GitHub Repo](https://github.com/ehsanaghaei/SecureBERT/blob/main/README.md)
- # Reference
- Use the paper below to cite our work.
- https://arxiv.org/pdf/2204.02685
   ** The paper has been accepted and presented in "EAI SecureComm 2022 - 18th EAI International Conference on Security and Privacy in Communication Networks".**
-  ***See the presentation on [YouTube](https://www.youtube.com/watch?v=G8WzvThGG8c&t=8s)***

 # SecureBERT: A Domain-Specific Language Model for Cybersecurity
 SecureBERT is a domain-specific language model based on RoBERTa which is trained on a huge amount of cybersecurity data and fine-tuned/tweaked to understand/represent cybersecurity textual data.
+[SecureBERT](https://arxiv.org/pdf/2204.02685) is a domain-specific language model to represent cybersecurity textual data which is trained on a large amount of in-domain text crawled from online resources. ***See the presentation on [YouTube](https://www.youtube.com/watch?v=G8WzvThGG8c&t=8s)***
+See details at [GitHub Repo](https://github.com/ehsanaghaei/SecureBERT/blob/main/README.md)
   ** The paper has been accepted and presented in "EAI SecureComm 2022 - 18th EAI International Conference on Security and Privacy in Communication Networks".**
+![image](https://user-images.githubusercontent.com/46252665/195998237-9bbed621-8002-4287-ac0d-19c4f603d919.png)
+## SecureBERT can be used as the base model for any downstream task including text classification, NER, Seq-to-Seq, QA, etc.
+* SecureBERT has demonstrated significantly higher performance in predicting masked words within the text when compared to existing models like RoBERTa (base and large), SciBERT, and SecBERT.
+* SecureBERT has also demonstrated promising performance in preserving general English language understanding (representation).
+# How to use SecureBERT
+SecureBERT has been uploaded to [Huggingface](https://huggingface.co/ehsanaghaei/SecureBERT) framework. You may use the code below
+```python
+from transformers import RobertaTokenizer, RobertaModel
+import torch
+tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT")
+model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT")
+inputs = tokenizer("This is SecureBERT!", return_tensors="pt")
+outputs = model(**inputs)
+last_hidden_states = outputs.last_hidden_state
+Or just clone the repo:
+```bash
+git lfs install
+git clone https://huggingface.co/ehsanaghaei/SecureBERT
+# if you want to clone without large files – just their pointers
+# prepend your git clone with the following env var:
+GIT_LFS_SKIP_SMUDGE=1
+```
+## Fill Mask
+SecureBERT has been trained on MLM. Use the code below to predict the masked word within the given sentences:
+```python
+#!pip install transformers
+#!pip install torch
+#!pip install tokenizers
+import torch
+import transformers
+from transformers import RobertaTokenizer, RobertaTokenizerFast
+tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT")
+model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT")
+def predict_mask(sent, tokenizer, model, topk =10, print_results = True):
+    token_ids = tokenizer.encode(sent, return_tensors='pt')
+    masked_position = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero()
+    masked_pos = [mask.item() for mask in masked_position]
+    words = []
+    with torch.no_grad():
+        output = model(token_ids)
+    last_hidden_state = output[0].squeeze()
+    list_of_list = []
+    for index, mask_index in enumerate(masked_pos):
+        mask_hidden_state = last_hidden_state[mask_index]
+        idx = torch.topk(mask_hidden_state, k=topk, dim=0)[1]
+        words = [tokenizer.decode(i.item()).strip() for i in idx]
+        words = [w.replace(' ','') for w in words]
+        list_of_list.append(words)
+        if print_results:
+            print("Mask ", "Predictions : ", words)
+    best_guess = ""
+    for j in list_of_list:
+        best_guess = best_guess + "," + j[0]
+    return words
+while True:
+    sent = input("Text here: \t")
+    print("SecureBERT: ")
+    predict_mask(sent, tokenizer, model)
+    print("===========================\n")
+```