SecureBERT_Plus / README.md

Update README.md

85a7ab3 verified 3 months ago

4.78 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	tags:
	- cybersecurity
	widget:
	- text: >-
	Native API functions such as <mask> may be directly invoked via system calls
	(syscalls). However, these features are also commonly exposed to user-mode
	applications through interfaces and libraries.
	example_title: Native API functions
	- text: >-
	One way to explicitly assign the PPID of a new process is through the <mask>
	API call, which includes a parameter for defining the PPID.
	example_title: Assigning the PPID of a new process
	- text: >-
	Enable Safe DLL Search Mode to ensure that system DLLs in more restricted
	directories (e.g., %<mask>%) are prioritized over DLLs in less secure
	locations such as a user’s home directory.
	example_title: Enable Safe DLL Search Mode
	- text: >-
	GuLoader is a file downloader that has been active since at least December
	2019. It has been used to distribute a variety of <mask>, including NETWIRE,
	Agent Tesla, NanoCore, and FormBook.
	example_title: GuLoader is a file downloader
	new_version: cisco-ai/SecureBERT2.0-base
	base_model:
	- ehsanaghaei/SecureBERT
	---

	# SecureBERT+

	SecureBERT+ is an enhanced version of [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT), trained on a corpus five times larger than its predecessor and leveraging the computational power of 8×A100 GPUs.

	This model delivers an average 6% improvement in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.

	---

	## Dataset
	SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.

	![dataset](https://cdn-uploads.huggingface.co/production/uploads/6340b0bd77fd972573eb2f9b/pO-v6961YI1D0IPcm0027.png)

	---

	## Using SecureBERT+

	SecureBERT+ is available on the [Hugging Face Hub](https://huggingface.co/ehsanaghaei/SecureBERT_Plus).

	### Load the Model
	```python
	from transformers import RobertaTokenizer, RobertaModel
	import torch

	tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
	model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

	inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
	outputs = model(**inputs)

	last_hidden_states = outputs.last_hidden_state
	```

	# Masked Language Modeling Example

	Use the code below to predict masked words in text:
	```python
	#!pip install transformers torch tokenizers

	import torch
	import transformers
	from transformers import RobertaTokenizerFast

	tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
	model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

	def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
	token_ids = tokenizer.encode(sent, return_tensors='pt')
	masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
	words = []

	with torch.no_grad():
	output = model(token_ids)

	for pos in masked_pos:
	logits = output.logits[0, pos]
	top_tokens = torch.topk(logits, k=topk).indices
	predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
	words.append(predictions)
	if print_results:
	print(f"Mask Predictions: {predictions}")

	return words
	```

	# Limitations & Risks

	Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.

	Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.

	Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.

	Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.

	Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.

	Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.

	# Reference
	```
	@inproceedings{aghaei2023securebert,
	title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
	author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
	booktitle={Security and Privacy in Communication Networks:
	18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings},
	pages={39--56},
	year={2023},
	organization={Springer}
	}
	```