| ---
|
| license: apache-2.0
|
| language:
|
| - en
|
| tags:
|
| - ai-security
|
| - llm-security
|
| - prompt-injection
|
| - jailbreak-detection
|
| - anomaly-detection
|
| - threat-detection
|
| - cybersecurity
|
| - nlp
|
| - pytorch
|
| - sklearn
|
| library_name: pytorch
|
| pipeline_tag: text-classification
|
| ---
|
|
|
| # AISecOps β Trained Security Models
|
|
|
| > Finetuned models powering the [AISecOps](https://github.com/Tarunvoff/LLM-FIREWALL) AI Security Operations Platform.
|
|
|
| These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.
|
|
|
| ---
|
|
|
| ## Model Overview
|
|
|
| | File | Type | Purpose | Size |
|
| |---|---|---|---|
|
| | `trajectory_model_best.pt` | PyTorch Transformer | Session-level escalation detector (best checkpoint) | 150 MB |
|
| | `trajectory_model_final.pt` | PyTorch Transformer | Session-level escalation detector (final epoch) | 50 MB |
|
| | `isolation_forest.pkl` | scikit-learn | One-class anomaly detector for prompt embeddings | 5.5 MB |
|
| | `fusion_model.pt` | PyTorch MLP | Score fusion combiner (final stage classifier) | 21 KB |
|
| | `fusion_threshold.json` | Config | Optimal decision threshold (Youden J calibration) | β |
|
| | `trajectory_model_best_config.json` | Config | Trajectory model architecture spec | β |
|
| | `training_feature_stats.json` | Config | Feature normalisation statistics | β |
|
|
|
| ---
|
|
|
| ## Pipeline Position
|
|
|
| These models run inside the AISecOps 6-layer security pipeline:
|
|
|
| ```
|
| User Prompt
|
| β
|
| FastPreFilter (regex, <5 ms)
|
| β
|
| Threat Detection β isolation_forest.pkl runs here
|
| β trajectory_model_best.pt runs here
|
| Fusion Engine β fusion_model.pt runs here
|
| β
|
| Policy Decision
|
| β
|
| LLM / Target Endpoint
|
| β
|
| Output Security
|
| β
|
| Safe Response
|
| ```
|
|
|
| ---
|
|
|
| ## Model Details
|
|
|
| ### 1. Trajectory Model (`trajectory_model_best.pt`)
|
|
|
| A Transformer encoder that tracks **session-level escalation patterns** β detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.
|
|
|
| **Architecture:**
|
|
|
| | Parameter | Value |
|
| |---|---|
|
| | Input dimension | 1024 (E5-large-v2 embeddings) |
|
| | Hidden dimension | 512 |
|
| | Transformer layers | 4 |
|
| | Attention heads | 8 |
|
| | Dropout | 0.3 |
|
| | Max sequence length | 6 turns |
|
|
|
| **Training inputs:** Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.
|
| **Output:** Scalar escalation score in [0, 1].
|
|
|
| **Training data:** Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.
|
|
|
| ---
|
|
|
| ### 2. Isolation Forest (`isolation_forest.pkl`)
|
|
|
| A one-class anomaly detector trained **exclusively on benign prompt embeddings**.
|
|
|
| - Algorithm: scikit-learn `IsolationForest`
|
| - Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
|
| - Score normalisation: Percentile-based min-max to [0, 1]
|
| - Decision threshold: 0.5 (default)
|
| - Logic: Any prompt that deviates from the learned safe distribution is flagged
|
|
|
| **Score interpretation:**
|
|
|
| | Score | Meaning |
|
| |---|---|
|
| | 0.0 | Deep inside safe distribution β very normal |
|
| | 0.5 | Decision boundary |
|
| | 1.0 | Highly anomalous / likely adversarial |
|
|
|
| ---
|
|
|
| ### 3. Fusion MLP (`fusion_model.pt`)
|
|
|
| A small multi-layer perceptron that combines **all upstream model scores** into a single threat score.
|
|
|
| **Input features (6-dimensional):**
|
|
|
| | Feature | Source | Mean | Std |
|
| |---|---|---|---|
|
| | `anomaly_score` | IsolationForest | 0.538 | 0.227 |
|
| | `if_score` | IsolationForest (raw) | 0.478 | 0.215 |
|
| | `pattern_score` | Regex pre-filter | 0.311 | 0.341 |
|
| | `max_similarity_score` | FAISS vector search | 0.515 | 0.234 |
|
| | `trajectory_score` | Trajectory model | 0.497 | 0.260 |
|
| | `intent_entropy` | BART zero-shot | 0.494 | 0.250 |
|
|
|
| **Output:** Single scalar fusion score in [0, 1].
|
|
|
| **Decision threshold:** `0.46` (calibrated by maximising Youden J on validation set, Youden J = 0.9688).
|
|
|
| ---
|
|
|
| ## Usage
|
|
|
| ### Install dependencies
|
|
|
| ```bash
|
| pip install torch scikit-learn huggingface_hub
|
| ```
|
|
|
| ### Download all models
|
|
|
| ```python
|
| from huggingface_hub import hf_hub_download
|
|
|
| repo = "Tarunvoff/aisecops-models"
|
|
|
| # Download trained models
|
| hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt", local_dir="models/")
|
| hf_hub_download(repo_id=repo, filename="isolation_forest.pkl", local_dir="models/")
|
| hf_hub_download(repo_id=repo, filename="fusion_model.pt", local_dir="models/")
|
| hf_hub_download(repo_id=repo, filename="fusion_threshold.json", local_dir="models/")
|
| ```
|
|
|
| ### Or use the AISecOps download script
|
|
|
| ```bash
|
| git clone https://github.com/Tarunvoff/LLM-FIREWALL
|
| cd LLM-FIREWALL
|
|
|
| cp .env.example .env
|
| # Add to .env:
|
| # HF_TOKEN=your_token
|
| # AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models
|
|
|
| python scripts/download_models.py
|
| ```
|
|
|
| ### Load and run inference
|
|
|
| ```python
|
| import torch
|
| import pickle
|
| import json
|
|
|
| # ββ Fusion MLP ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
|
| fusion_model.eval()
|
|
|
| # 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
|
| features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
|
| with torch.no_grad():
|
| score = fusion_model(features).item()
|
|
|
| with open("models/fusion_threshold.json") as f:
|
| threshold = json.load(f)["optimal_threshold"] # 0.46
|
|
|
| print(f"Fusion score: {score:.3f}")
|
| print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")
|
|
|
| # ββ Isolation Forest ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| with open("models/isolation_forest.pkl", "rb") as f:
|
| iso_forest = pickle.load(f)
|
|
|
| # embedding is a 1024-d numpy array from E5-large-v2
|
| # anomaly_score = iso_forest.predict(embedding.reshape(1, -1))
|
| ```
|
|
|
| ---
|
|
|
| ## Evaluation
|
|
|
| | Metric | Value |
|
| |---|---|
|
| | Fusion threshold (Youden J optimised) | 0.46 |
|
| | Youden J statistic | 0.9688 |
|
| | Validation ROC-AUC | 0.21 |
|
| | Test ROC-AUC | 0.27 |
|
|
|
| > **Note:** The low ROC-AUC values reflect the challenge of the task β adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.
|
|
|
| ---
|
|
|
| ## Intended Use
|
|
|
| These models are designed **exclusively for AI security applications**:
|
|
|
| - Detecting prompt injection attacks against LLM systems
|
| - Identifying jailbreak attempts in real-time
|
| - Session-level escalation monitoring in multi-turn conversations
|
| - Anomaly detection on user input to AI assistants
|
|
|
| **Out-of-scope uses:** General text classification, sentiment analysis, or any purpose unrelated to AI system security.
|
|
|
| ---
|
|
|
| ## Training Data
|
|
|
| Models were trained on a combination of:
|
|
|
| - **JailbreakBench** β standardised jailbreak prompt benchmark
|
| - **Prompt injection corpora** β curated adversarial prompt datasets
|
| - **Synthetic escalation sequences** β programmatically generated multi-turn escalation patterns
|
| - **Safe prompts** β ShareGPT conversations, standard assistant interactions (IsolationForest negative class)
|
|
|
| ---
|
|
|
| ## Limitations
|
|
|
| - Models are optimised for English-language prompts. Performance on other languages is not evaluated.
|
| - Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
|
| - The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
|
| - These models should be used as **one layer** in a defence-in-depth strategy, not as the sole security control.
|
|
|
| ---
|
|
|
| ## Citation
|
|
|
| If you use these models, please cite the AISecOps project:
|
|
|
| ```bibtex
|
| @software{aisecops2026,
|
| author = {Tarunvoff},
|
| title = {AISecOps: AI Security Operations Platform},
|
| year = {2026},
|
| url = {https://github.com/Tarunvoff/LLM-FIREWALL},
|
| license = {MIT}
|
| }
|
| ```
|
|
|
| ---
|
|
|
| ## License
|
|
|
| Apache License 2.0 β see [LICENSE](https://github.com/Tarunvoff/LLM-FIREWALL/blob/public-release/LICENSE).
|
|
|