--- license: apache-2.0 language: - en tags: - ai-security - llm-security - prompt-injection - jailbreak-detection - anomaly-detection - threat-detection - cybersecurity - nlp - pytorch - sklearn library_name: pytorch pipeline_tag: text-classification --- # AISecOps — Trained Security Models > Finetuned models powering the [AISecOps](https://github.com/Tarunvoff/LLM-FIREWALL) AI Security Operations Platform. These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks. --- ## Model Overview | File | Type | Purpose | Size | |---|---|---|---| | `trajectory_model_best.pt` | PyTorch Transformer | Session-level escalation detector (best checkpoint) | 150 MB | | `trajectory_model_final.pt` | PyTorch Transformer | Session-level escalation detector (final epoch) | 50 MB | | `isolation_forest.pkl` | scikit-learn | One-class anomaly detector for prompt embeddings | 5.5 MB | | `fusion_model.pt` | PyTorch MLP | Score fusion combiner (final stage classifier) | 21 KB | | `fusion_threshold.json` | Config | Optimal decision threshold (Youden J calibration) | — | | `trajectory_model_best_config.json` | Config | Trajectory model architecture spec | — | | `training_feature_stats.json` | Config | Feature normalisation statistics | — | --- ## Pipeline Position These models run inside the AISecOps 6-layer security pipeline: ``` User Prompt ↓ FastPreFilter (regex, <5 ms) ↓ Threat Detection ← isolation_forest.pkl runs here ↓ trajectory_model_best.pt runs here Fusion Engine ← fusion_model.pt runs here ↓ Policy Decision ↓ LLM / Target Endpoint ↓ Output Security ↓ Safe Response ``` --- ## Model Details ### 1. Trajectory Model (`trajectory_model_best.pt`) A Transformer encoder that tracks **session-level escalation patterns** — detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns. **Architecture:** | Parameter | Value | |---|---| | Input dimension | 1024 (E5-large-v2 embeddings) | | Hidden dimension | 512 | | Transformer layers | 4 | | Attention heads | 8 | | Dropout | 0.3 | | Max sequence length | 6 turns | **Training inputs:** Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions. **Output:** Scalar escalation score in [0, 1]. **Training data:** Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets. --- ### 2. Isolation Forest (`isolation_forest.pkl`) A one-class anomaly detector trained **exclusively on benign prompt embeddings**. - Algorithm: scikit-learn `IsolationForest` - Training data: Safe prompt embeddings (E5-large-v2, 1024-d) - Score normalisation: Percentile-based min-max to [0, 1] - Decision threshold: 0.5 (default) - Logic: Any prompt that deviates from the learned safe distribution is flagged **Score interpretation:** | Score | Meaning | |---|---| | 0.0 | Deep inside safe distribution — very normal | | 0.5 | Decision boundary | | 1.0 | Highly anomalous / likely adversarial | --- ### 3. Fusion MLP (`fusion_model.pt`) A small multi-layer perceptron that combines **all upstream model scores** into a single threat score. **Input features (6-dimensional):** | Feature | Source | Mean | Std | |---|---|---|---| | `anomaly_score` | IsolationForest | 0.538 | 0.227 | | `if_score` | IsolationForest (raw) | 0.478 | 0.215 | | `pattern_score` | Regex pre-filter | 0.311 | 0.341 | | `max_similarity_score` | FAISS vector search | 0.515 | 0.234 | | `trajectory_score` | Trajectory model | 0.497 | 0.260 | | `intent_entropy` | BART zero-shot | 0.494 | 0.250 | **Output:** Single scalar fusion score in [0, 1]. **Decision threshold:** `0.46` (calibrated by maximising Youden J on validation set, Youden J = 0.9688). --- ## Usage ### Install dependencies ```bash pip install torch scikit-learn huggingface_hub ``` ### Download all models ```python from huggingface_hub import hf_hub_download repo = "Tarunvoff/aisecops-models" # Download trained models hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt", local_dir="models/") hf_hub_download(repo_id=repo, filename="isolation_forest.pkl", local_dir="models/") hf_hub_download(repo_id=repo, filename="fusion_model.pt", local_dir="models/") hf_hub_download(repo_id=repo, filename="fusion_threshold.json", local_dir="models/") ``` ### Or use the AISecOps download script ```bash git clone https://github.com/Tarunvoff/LLM-FIREWALL cd LLM-FIREWALL cp .env.example .env # Add to .env: # HF_TOKEN=your_token # AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models python scripts/download_models.py ``` ### Load and run inference ```python import torch import pickle import json # ── Fusion MLP ──────────────────────────────────────────────────────────────── fusion_model = torch.load("models/fusion_model.pt", map_location="cpu") fusion_model.eval() # 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy] features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]]) with torch.no_grad(): score = fusion_model(features).item() with open("models/fusion_threshold.json") as f: threshold = json.load(f)["optimal_threshold"] # 0.46 print(f"Fusion score: {score:.3f}") print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}") # ── Isolation Forest ────────────────────────────────────────────────────────── with open("models/isolation_forest.pkl", "rb") as f: iso_forest = pickle.load(f) # embedding is a 1024-d numpy array from E5-large-v2 # anomaly_score = iso_forest.predict(embedding.reshape(1, -1)) ``` --- ## Evaluation | Metric | Value | |---|---| | Fusion threshold (Youden J optimised) | 0.46 | | Youden J statistic | 0.9688 | | Validation ROC-AUC | 0.21 | | Test ROC-AUC | 0.27 | > **Note:** The low ROC-AUC values reflect the challenge of the task — adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution. --- ## Intended Use These models are designed **exclusively for AI security applications**: - Detecting prompt injection attacks against LLM systems - Identifying jailbreak attempts in real-time - Session-level escalation monitoring in multi-turn conversations - Anomaly detection on user input to AI assistants **Out-of-scope uses:** General text classification, sentiment analysis, or any purpose unrelated to AI system security. --- ## Training Data Models were trained on a combination of: - **JailbreakBench** — standardised jailbreak prompt benchmark - **Prompt injection corpora** — curated adversarial prompt datasets - **Synthetic escalation sequences** — programmatically generated multi-turn escalation patterns - **Safe prompts** — ShareGPT conversations, standard assistant interactions (IsolationForest negative class) --- ## Limitations - Models are optimised for English-language prompts. Performance on other languages is not evaluated. - Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback. - The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only. - These models should be used as **one layer** in a defence-in-depth strategy, not as the sole security control. --- ## Citation If you use these models, please cite the AISecOps project: ```bibtex @software{aisecops2026, author = {Tarunvoff}, title = {AISecOps: AI Security Operations Platform}, year = {2026}, url = {https://github.com/Tarunvoff/LLM-FIREWALL}, license = {MIT} } ``` --- ## License Apache License 2.0 — see [LICENSE](https://github.com/Tarunvoff/LLM-FIREWALL/blob/public-release/LICENSE).