File size: 4,775 Bytes
8736038
 
2ff2376
 
 
 
0391fb4
f8f3361
0aa88f1
 
 
4c48ccd
f8f3361
0aa88f1
 
4c48ccd
f8f3361
 
 
 
4c48ccd
f8f3361
 
0aa88f1
 
4c48ccd
0aa88f1
85a7ab3
 
8736038
2ff2376
f8f3361
 
a553540
2ff2376
a553540
2ff2376
f8f3361
2ff2376
f8f3361
 
2308937
f8f3361
 
 
 
 
 
 
 
 
2308937
 
 
 
 
 
 
 
 
 
 
 
 
f8f3361
2308937
f8f3361
2308937
f8f3361
2308937
 
 
f8f3361
2308937
 
 
 
f8f3361
2308937
f8f3361
2308937
f8f3361
2308937
 
 
f8f3361
 
 
 
 
2308937
f8f3361
2308937
 
f8f3361
2308937
f8f3361
2308937
f8f3361
2308937
f8f3361
633a973
f8f3361
633a973
f8f3361
633a973
f8f3361
2ff2376
f8f3361
2ff2376
2308937
f8f3361
2308937
 
 
 
f8f3361
2308937
 
f8f3361
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: cc-by-nc-4.0
language:
- en
tags:
- cybersecurity
widget:
- text: >-
    Native API functions such as <mask> may be directly invoked via system calls
    (syscalls). However, these features are also commonly exposed to user-mode
    applications through interfaces and libraries.
  example_title: Native API functions
- text: >-
    One way to explicitly assign the PPID of a new process is through the <mask>
    API call, which includes a parameter for defining the PPID.
  example_title: Assigning the PPID of a new process
- text: >-
    Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
    directories (e.g., %<mask>%) are prioritized over DLLs in less secure
    locations such as a user’s home directory.
  example_title: Enable Safe DLL Search Mode
- text: >-
    GuLoader is a file downloader that has been active since at least December
    2019. It has been used to distribute a variety of <mask>, including NETWIRE,
    Agent Tesla, NanoCore, and FormBook.
  example_title: GuLoader is a file downloader
new_version: cisco-ai/SecureBERT2.0-base
base_model:
- ehsanaghaei/SecureBERT
---

# SecureBERT+  

**SecureBERT+** is an enhanced version of [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT), trained on a corpus **five times larger** than its predecessor and leveraging the computational power of **8×A100 GPUs**.  

This model delivers an **average 6% improvement** in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.  

---

## Dataset  
SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.  

![dataset](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)  

---

## Using SecureBERT+  

SecureBERT+ is available on the [Hugging Face Hub](https://huggingface.co/ehsanaghaei/SecureBERT_Plus).  

### Load the Model
```python
from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
```

# Masked Language Modeling Example

Use the code below to predict masked words in text:
```python
#!pip install transformers torch tokenizers

import torch
import transformers
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
    words = []

    with torch.no_grad():
        output = model(token_ids)

    for pos in masked_pos:
        logits = output.logits[0, pos]
        top_tokens = torch.topk(logits, k=topk).indices
        predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
        words.append(predictions)
        if print_results:
            print(f"Mask Predictions: {predictions}")

    return words
```

# Limitations & Risks

Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.

Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.

Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.

Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.

Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.

Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.

# Reference
```
@inproceedings{aghaei2023securebert, 
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, 
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, 
booktitle={Security and Privacy in Communication Networks: 
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, 
pages={39--56}, 
year={2023}, 
organization={Springer} 
}
```