---
license: apache-2.0
language:
  - en
tags:
  - ai-security
  - llm-security
  - prompt-injection
  - jailbreak-detection
  - anomaly-detection
  - threat-detection
  - cybersecurity
  - nlp
  - pytorch
  - sklearn
library_name: pytorch
pipeline_tag: text-classification
---

# AISecOps — Trained Security Models

> Finetuned models powering the [AISecOps](https://github.com/Tarunvoff/LLM-FIREWALL) AI Security Operations Platform.

These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.

---

## Model Overview

| File | Type | Purpose | Size |
|---|---|---|---|
| `trajectory_model_best.pt` | PyTorch Transformer | Session-level escalation detector (best checkpoint) | 150 MB |
| `trajectory_model_final.pt` | PyTorch Transformer | Session-level escalation detector (final epoch) | 50 MB |
| `isolation_forest.pkl` | scikit-learn | One-class anomaly detector for prompt embeddings | 5.5 MB |
| `fusion_model.pt` | PyTorch MLP | Score fusion combiner (final stage classifier) | 21 KB |
| `fusion_threshold.json` | Config | Optimal decision threshold (Youden J calibration) | — |
| `trajectory_model_best_config.json` | Config | Trajectory model architecture spec | — |
| `training_feature_stats.json` | Config | Feature normalisation statistics | — |

---

## Pipeline Position

These models run inside the AISecOps 6-layer security pipeline:

```
User Prompt
    ↓
FastPreFilter          (regex, <5 ms)
    ↓
Threat Detection       ← isolation_forest.pkl runs here
    ↓                     trajectory_model_best.pt runs here
Fusion Engine          ← fusion_model.pt runs here
    ↓
Policy Decision
    ↓
LLM / Target Endpoint
    ↓
Output Security
    ↓
Safe Response
```

---

## Model Details

### 1. Trajectory Model (`trajectory_model_best.pt`)

A Transformer encoder that tracks **session-level escalation patterns** — detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.

**Architecture:**

| Parameter | Value |
|---|---|
| Input dimension | 1024 (E5-large-v2 embeddings) |
| Hidden dimension | 512 |
| Transformer layers | 4 |
| Attention heads | 8 |
| Dropout | 0.3 |
| Max sequence length | 6 turns |

**Training inputs:** Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.  
**Output:** Scalar escalation score in [0, 1].

**Training data:** Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.

---

### 2. Isolation Forest (`isolation_forest.pkl`)

A one-class anomaly detector trained **exclusively on benign prompt embeddings**.

- Algorithm: scikit-learn `IsolationForest`
- Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
- Score normalisation: Percentile-based min-max to [0, 1]
- Decision threshold: 0.5 (default)
- Logic: Any prompt that deviates from the learned safe distribution is flagged

**Score interpretation:**

| Score | Meaning |
|---|---|
| 0.0 | Deep inside safe distribution — very normal |
| 0.5 | Decision boundary |
| 1.0 | Highly anomalous / likely adversarial |

---

### 3. Fusion MLP (`fusion_model.pt`)

A small multi-layer perceptron that combines **all upstream model scores** into a single threat score.

**Input features (6-dimensional):**

| Feature | Source | Mean | Std |
|---|---|---|---|
| `anomaly_score` | IsolationForest | 0.538 | 0.227 |
| `if_score` | IsolationForest (raw) | 0.478 | 0.215 |
| `pattern_score` | Regex pre-filter | 0.311 | 0.341 |
| `max_similarity_score` | FAISS vector search | 0.515 | 0.234 |
| `trajectory_score` | Trajectory model | 0.497 | 0.260 |
| `intent_entropy` | BART zero-shot | 0.494 | 0.250 |

**Output:** Single scalar fusion score in [0, 1].

**Decision threshold:** `0.46` (calibrated by maximising Youden J on validation set, Youden J = 0.9688).

---

## Usage

### Install dependencies

```bash
pip install torch scikit-learn huggingface_hub
```

### Download all models

```python
from huggingface_hub import hf_hub_download

repo = "Tarunvoff/aisecops-models"

# Download trained models
hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt",  local_dir="models/")
hf_hub_download(repo_id=repo, filename="isolation_forest.pkl",       local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_model.pt",            local_dir="models/")
hf_hub_download(repo_id=repo, filename="fusion_threshold.json",      local_dir="models/")
```

### Or use the AISecOps download script

```bash
git clone https://github.com/Tarunvoff/LLM-FIREWALL
cd LLM-FIREWALL

cp .env.example .env
# Add to .env:
#   HF_TOKEN=your_token
#   AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models

python scripts/download_models.py
```

### Load and run inference

```python
import torch
import pickle
import json

# ── Fusion MLP ────────────────────────────────────────────────────────────────
fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
fusion_model.eval()

# 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
with torch.no_grad():
    score = fusion_model(features).item()

with open("models/fusion_threshold.json") as f:
    threshold = json.load(f)["optimal_threshold"]  # 0.46

print(f"Fusion score: {score:.3f}")
print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")

# ── Isolation Forest ──────────────────────────────────────────────────────────
with open("models/isolation_forest.pkl", "rb") as f:
    iso_forest = pickle.load(f)

# embedding is a 1024-d numpy array from E5-large-v2
# anomaly_score = iso_forest.predict(embedding.reshape(1, -1))
```

---

## Evaluation

| Metric | Value |
|---|---|
| Fusion threshold (Youden J optimised) | 0.46 |
| Youden J statistic | 0.9688 |
| Validation ROC-AUC | 0.21 |
| Test ROC-AUC | 0.27 |

> **Note:** The low ROC-AUC values reflect the challenge of the task — adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.

---

## Intended Use

These models are designed **exclusively for AI security applications**:

- Detecting prompt injection attacks against LLM systems
- Identifying jailbreak attempts in real-time
- Session-level escalation monitoring in multi-turn conversations
- Anomaly detection on user input to AI assistants

**Out-of-scope uses:** General text classification, sentiment analysis, or any purpose unrelated to AI system security.

---

## Training Data

Models were trained on a combination of:

- **JailbreakBench** — standardised jailbreak prompt benchmark
- **Prompt injection corpora** — curated adversarial prompt datasets
- **Synthetic escalation sequences** — programmatically generated multi-turn escalation patterns
- **Safe prompts** — ShareGPT conversations, standard assistant interactions (IsolationForest negative class)

---

## Limitations

- Models are optimised for English-language prompts. Performance on other languages is not evaluated.
- Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
- The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
- These models should be used as **one layer** in a defence-in-depth strategy, not as the sole security control.

---

## Citation

If you use these models, please cite the AISecOps project:

```bibtex
@software{aisecops2026,
  author    = {Tarunvoff},
  title     = {AISecOps: AI Security Operations Platform},
  year      = {2026},
  url       = {https://github.com/Tarunvoff/LLM-FIREWALL},
  license   = {MIT}
}
```

---

## License

Apache License 2.0 — see [LICENSE](https://github.com/Tarunvoff/LLM-FIREWALL/blob/public-release/LICENSE).