aisecops-models / README.md
Tarunvoff's picture
chore: change license from MIT to Apache 2.0
43c04db verified
---
license: apache-2.0
language:
- en
tags:
- ai-security
- llm-security
- prompt-injection
- jailbreak-detection
- anomaly-detection
- threat-detection
- cybersecurity
- nlp
- pytorch
- sklearn
library_name: pytorch
pipeline_tag: text-classification
---
# AISecOps β€” Trained Security Models
> Finetuned models powering the [AISecOps](https://github.com/Tarunvoff/LLM-FIREWALL) AI Security Operations Platform.
These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.
---
## Model Overview
| File | Type | Purpose | Size |
|---|---|---|---|
| `trajectory_model_best.pt` | PyTorch Transformer | Session-level escalation detector (best checkpoint) | 150 MB |
| `trajectory_model_final.pt` | PyTorch Transformer | Session-level escalation detector (final epoch) | 50 MB |
| `isolation_forest.pkl` | scikit-learn | One-class anomaly detector for prompt embeddings | 5.5 MB |
| `fusion_model.pt` | PyTorch MLP | Score fusion combiner (final stage classifier) | 21 KB |
| `fusion_threshold.json` | Config | Optimal decision threshold (Youden J calibration) | β€” |
| `trajectory_model_best_config.json` | Config | Trajectory model architecture spec | β€” |
| `training_feature_stats.json` | Config | Feature normalisation statistics | β€” |
---
## Pipeline Position
These models run inside the AISecOps 6-layer security pipeline:
```
User Prompt
↓
FastPreFilter (regex, <5 ms)
↓
Threat Detection ← isolation_forest.pkl runs here
↓ trajectory_model_best.pt runs here
Fusion Engine ← fusion_model.pt runs here
↓
Policy Decision
↓
LLM / Target Endpoint
↓
Output Security
↓
Safe Response
```
---
## Model Details
### 1. Trajectory Model (`trajectory_model_best.pt`)
A Transformer encoder that tracks **session-level escalation patterns** β€” detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.
**Architecture:**
| Parameter | Value |
|---|---|
| Input dimension | 1024 (E5-large-v2 embeddings) |
| Hidden dimension | 512 |
| Transformer layers | 4 |
| Attention heads | 8 |
| Dropout | 0.3 |
| Max sequence length | 6 turns |
**Training inputs:** Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.
**Output:** Scalar escalation score in [0, 1].
**Training data:** Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.
---
### 2. Isolation Forest (`isolation_forest.pkl`)
A one-class anomaly detector trained **exclusively on benign prompt embeddings**.
- Algorithm: scikit-learn `IsolationForest`
- Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
- Score normalisation: Percentile-based min-max to [0, 1]
- Decision threshold: 0.5 (default)
- Logic: Any prompt that deviates from the learned safe distribution is flagged
**Score interpretation:**
| Score | Meaning |
|---|---|
| 0.0 | Deep inside safe distribution β€” very normal |
| 0.5 | Decision boundary |
| 1.0 | Highly anomalous / likely adversarial |
---
### 3. Fusion MLP (`fusion_model.pt`)
A small multi-layer perceptron that combines **all upstream model scores** into a single threat score.
**Input features (6-dimensional):**
| Feature | Source | Mean | Std |
|---|---|---|---|
| `anomaly_score` | IsolationForest | 0.538 | 0.227 |
| `if_score` | IsolationForest (raw) | 0.478 | 0.215 |
| `pattern_score` | Regex pre-filter | 0.311 | 0.341 |
| `max_similarity_score` | FAISS vector search | 0.515 | 0.234 |
| `trajectory_score` | Trajectory model | 0.497 | 0.260 |
| `intent_entropy` | BART zero-shot | 0.494 | 0.250 |
**Output:** Single scalar fusion score in [0, 1].
**Decision threshold:** `0.46` (calibrated by maximising Youden J on validation set, Youden J = 0.9688).
---
## Usage
### Install dependencies
```bash
pip install torch scikit-learn huggingface_hub
```
### Download all models
```python
from huggingface_hub import hf_hub_download
repo = "Tarunvoff/aisecops-models"
# Download trained models
hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt", local_dir="models/")
hf_hub_download(repo_id=repo, filename="isolation_forest.pkl", local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_model.pt", local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_threshold.json", local_dir="models/")
```
### Or use the AISecOps download script
```bash
git clone https://github.com/Tarunvoff/LLM-FIREWALL
cd LLM-FIREWALL
cp .env.example .env
# Add to .env:
# HF_TOKEN=your_token
# AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models
python scripts/download_models.py
```
### Load and run inference
```python
import torch
import pickle
import json
# ── Fusion MLP ────────────────────────────────────────────────────────────────
fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
fusion_model.eval()
# 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
with torch.no_grad():
score = fusion_model(features).item()
with open("models/fusion_threshold.json") as f:
threshold = json.load(f)["optimal_threshold"] # 0.46
print(f"Fusion score: {score:.3f}")
print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")
# ── Isolation Forest ──────────────────────────────────────────────────────────
with open("models/isolation_forest.pkl", "rb") as f:
iso_forest = pickle.load(f)
# embedding is a 1024-d numpy array from E5-large-v2
# anomaly_score = iso_forest.predict(embedding.reshape(1, -1))
```
---
## Evaluation
| Metric | Value |
|---|---|
| Fusion threshold (Youden J optimised) | 0.46 |
| Youden J statistic | 0.9688 |
| Validation ROC-AUC | 0.21 |
| Test ROC-AUC | 0.27 |
> **Note:** The low ROC-AUC values reflect the challenge of the task β€” adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.
---
## Intended Use
These models are designed **exclusively for AI security applications**:
- Detecting prompt injection attacks against LLM systems
- Identifying jailbreak attempts in real-time
- Session-level escalation monitoring in multi-turn conversations
- Anomaly detection on user input to AI assistants
**Out-of-scope uses:** General text classification, sentiment analysis, or any purpose unrelated to AI system security.
---
## Training Data
Models were trained on a combination of:
- **JailbreakBench** β€” standardised jailbreak prompt benchmark
- **Prompt injection corpora** β€” curated adversarial prompt datasets
- **Synthetic escalation sequences** β€” programmatically generated multi-turn escalation patterns
- **Safe prompts** β€” ShareGPT conversations, standard assistant interactions (IsolationForest negative class)
---
## Limitations
- Models are optimised for English-language prompts. Performance on other languages is not evaluated.
- Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
- The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
- These models should be used as **one layer** in a defence-in-depth strategy, not as the sole security control.
---
## Citation
If you use these models, please cite the AISecOps project:
```bibtex
@software{aisecops2026,
author = {Tarunvoff},
title = {AISecOps: AI Security Operations Platform},
year = {2026},
url = {https://github.com/Tarunvoff/LLM-FIREWALL},
license = {MIT}
}
```
---
## License
Apache License 2.0 β€” see [LICENSE](https://github.com/Tarunvoff/LLM-FIREWALL/blob/public-release/LICENSE).