|
|
--- |
|
|
license: cc-by-nc-4.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- cybersecurity |
|
|
widget: |
|
|
- text: >- |
|
|
Native API functions such as <mask> may be directly invoked via system calls |
|
|
(syscalls). However, these features are also commonly exposed to user-mode |
|
|
applications through interfaces and libraries. |
|
|
example_title: Native API functions |
|
|
- text: >- |
|
|
One way to explicitly assign the PPID of a new process is through the <mask> |
|
|
API call, which includes a parameter for defining the PPID. |
|
|
example_title: Assigning the PPID of a new process |
|
|
- text: >- |
|
|
Enable Safe DLL Search Mode to ensure that system DLLs in more restricted |
|
|
directories (e.g., %<mask>%) are prioritized over DLLs in less secure |
|
|
locations such as a user’s home directory. |
|
|
example_title: Enable Safe DLL Search Mode |
|
|
- text: >- |
|
|
GuLoader is a file downloader that has been active since at least December |
|
|
2019. It has been used to distribute a variety of <mask>, including NETWIRE, |
|
|
Agent Tesla, NanoCore, and FormBook. |
|
|
example_title: GuLoader is a file downloader |
|
|
new_version: cisco-ai/SecureBERT2.0-base |
|
|
base_model: |
|
|
- ehsanaghaei/SecureBERT |
|
|
--- |
|
|
|
|
|
# SecureBERT+ |
|
|
|
|
|
**SecureBERT+** is an enhanced version of [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT), trained on a corpus **five times larger** than its predecessor and leveraging the computational power of **8×A100 GPUs**. |
|
|
|
|
|
This model delivers an **average 6% improvement** in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain. |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data. |
|
|
|
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
## Using SecureBERT+ |
|
|
|
|
|
SecureBERT+ is available on the [Hugging Face Hub](https://huggingface.co/ehsanaghaei/SecureBERT_Plus). |
|
|
|
|
|
### Load the Model |
|
|
```python |
|
|
from transformers import RobertaTokenizer, RobertaModel |
|
|
import torch |
|
|
|
|
|
tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
|
|
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
|
|
|
|
|
inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
last_hidden_states = outputs.last_hidden_state |
|
|
``` |
|
|
|
|
|
# Masked Language Modeling Example |
|
|
|
|
|
Use the code below to predict masked words in text: |
|
|
```python |
|
|
#!pip install transformers torch tokenizers |
|
|
|
|
|
import torch |
|
|
import transformers |
|
|
from transformers import RobertaTokenizerFast |
|
|
|
|
|
tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
|
|
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
|
|
|
|
|
def predict_mask(sent, tokenizer, model, topk=10, print_results=True): |
|
|
token_ids = tokenizer.encode(sent, return_tensors='pt') |
|
|
masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist() |
|
|
words = [] |
|
|
|
|
|
with torch.no_grad(): |
|
|
output = model(token_ids) |
|
|
|
|
|
for pos in masked_pos: |
|
|
logits = output.logits[0, pos] |
|
|
top_tokens = torch.topk(logits, k=topk).indices |
|
|
predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens] |
|
|
words.append(predictions) |
|
|
if print_results: |
|
|
print(f"Mask Predictions: {predictions}") |
|
|
|
|
|
return words |
|
|
``` |
|
|
|
|
|
# Limitations & Risks |
|
|
|
|
|
Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains. |
|
|
|
|
|
Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies. |
|
|
|
|
|
Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior. |
|
|
|
|
|
Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams. |
|
|
|
|
|
Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology. |
|
|
|
|
|
Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals. |
|
|
|
|
|
# Reference |
|
|
``` |
|
|
@inproceedings{aghaei2023securebert, |
|
|
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, |
|
|
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, |
|
|
booktitle={Security and Privacy in Communication Networks: |
|
|
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, |
|
|
pages={39--56}, |
|
|
year={2023}, |
|
|
organization={Springer} |
|
|
} |
|
|
``` |