chore: change license from MIT to Apache 2.0

43c04db verified about 1 month ago

8.64 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- ai-security
	- llm-security
	- prompt-injection
	- jailbreak-detection
	- anomaly-detection
	- threat-detection
	- cybersecurity
	- nlp
	- pytorch
	- sklearn
	library_name: pytorch
	pipeline_tag: text-classification
	---

	# AISecOps — Trained Security Models

	> Finetuned models powering the [AISecOps](https://github.com/Tarunvoff/LLM-FIREWALL) AI Security Operations Platform.

	These models form the multi-layer threat detection pipeline that protects LLM systems from prompt injection, jailbreaks, and adversarial attacks.

	---

	## Model Overview

	\| File \| Type \| Purpose \| Size \|
	\|---\|---\|---\|---\|
	\| `trajectory_model_best.pt` \| PyTorch Transformer \| Session-level escalation detector (best checkpoint) \| 150 MB \|
	\| `trajectory_model_final.pt` \| PyTorch Transformer \| Session-level escalation detector (final epoch) \| 50 MB \|
	\| `isolation_forest.pkl` \| scikit-learn \| One-class anomaly detector for prompt embeddings \| 5.5 MB \|
	\| `fusion_model.pt` \| PyTorch MLP \| Score fusion combiner (final stage classifier) \| 21 KB \|
	\| `fusion_threshold.json` \| Config \| Optimal decision threshold (Youden J calibration) \| — \|
	\| `trajectory_model_best_config.json` \| Config \| Trajectory model architecture spec \| — \|
	\| `training_feature_stats.json` \| Config \| Feature normalisation statistics \| — \|

	---

	## Pipeline Position

	These models run inside the AISecOps 6-layer security pipeline:

	```
	User Prompt
	↓
	FastPreFilter (regex, <5 ms)
	↓
	Threat Detection ← isolation_forest.pkl runs here
	↓ trajectory_model_best.pt runs here
	Fusion Engine ← fusion_model.pt runs here
	↓
	Policy Decision
	↓
	LLM / Target Endpoint
	↓
	Output Security
	↓
	Safe Response
	```

	---

	## Model Details

	### 1. Trajectory Model (`trajectory_model_best.pt`)

	A Transformer encoder that tracks session-level escalation patterns — detecting when a conversation is gradually steering toward adversarial behaviour across multiple turns.

	Architecture:

	\| Parameter \| Value \|
	\|---\|---\|
	\| Input dimension \| 1024 (E5-large-v2 embeddings) \|
	\| Hidden dimension \| 512 \|
	\| Transformer layers \| 4 \|
	\| Attention heads \| 8 \|
	\| Dropout \| 0.3 \|
	\| Max sequence length \| 6 turns \|

	Training inputs: Sequences of E5-large-v2 embeddings (1024-d) from conversation sessions.
	Output: Scalar escalation score in [0, 1].

	Training data: Adversarial prompt datasets including JailbreakBench, prompt injection corpora, and synthetic escalation sequences. Safe prompts drawn from ShareGPT and standard assistant conversation datasets.

	---

	### 2. Isolation Forest (`isolation_forest.pkl`)

	A one-class anomaly detector trained exclusively on benign prompt embeddings.

	- Algorithm: scikit-learn `IsolationForest`
	- Training data: Safe prompt embeddings (E5-large-v2, 1024-d)
	- Score normalisation: Percentile-based min-max to [0, 1]
	- Decision threshold: 0.5 (default)
	- Logic: Any prompt that deviates from the learned safe distribution is flagged

	Score interpretation:

	\| Score \| Meaning \|
	\|---\|---\|
	\| 0.0 \| Deep inside safe distribution — very normal \|
	\| 0.5 \| Decision boundary \|
	\| 1.0 \| Highly anomalous / likely adversarial \|

	---

	### 3. Fusion MLP (`fusion_model.pt`)

	A small multi-layer perceptron that combines all upstream model scores into a single threat score.

	Input features (6-dimensional):

	\| Feature \| Source \| Mean \| Std \|
	\|---\|---\|---\|---\|
	\| `anomaly_score` \| IsolationForest \| 0.538 \| 0.227 \|
	\| `if_score` \| IsolationForest (raw) \| 0.478 \| 0.215 \|
	\| `pattern_score` \| Regex pre-filter \| 0.311 \| 0.341 \|
	\| `max_similarity_score` \| FAISS vector search \| 0.515 \| 0.234 \|
	\| `trajectory_score` \| Trajectory model \| 0.497 \| 0.260 \|
	\| `intent_entropy` \| BART zero-shot \| 0.494 \| 0.250 \|

	Output: Single scalar fusion score in [0, 1].

	Decision threshold: `0.46` (calibrated by maximising Youden J on validation set, Youden J = 0.9688).

	---

	## Usage

	### Install dependencies

	```bash
	pip install torch scikit-learn huggingface_hub
	```

	### Download all models

	```python
	from huggingface_hub import hf_hub_download

	repo = "Tarunvoff/aisecops-models"

	# Download trained models
	hf_hub_download(repo_id=repo, filename="trajectory_model_best.pt", local_dir="models/")
	hf_hub_download(repo_id=repo, filename="isolation_forest.pkl", local_dir="models/")
	hf_hub_download(repo_id=repo, filename="fusion_model.pt", local_dir="models/")
	hf_hub_download(repo_id=repo, filename="fusion_threshold.json", local_dir="models/")
	```

	### Or use the AISecOps download script

	```bash
	git clone https://github.com/Tarunvoff/LLM-FIREWALL
	cd LLM-FIREWALL

	cp .env.example .env
	# Add to .env:
	# HF_TOKEN=your_token
	# AISECOPS_MODELS_REPO=Tarunvoff/aisecops-models

	python scripts/download_models.py
	```

	### Load and run inference

	```python
	import torch
	import pickle
	import json

	# ── Fusion MLP ────────────────────────────────────────────────────────────────
	fusion_model = torch.load("models/fusion_model.pt", map_location="cpu")
	fusion_model.eval()

	# 6-D feature vector: [anomaly, if_score, pattern, similarity, trajectory, entropy]
	features = torch.tensor([[0.85, 0.78, 0.60, 0.91, 0.72, 0.44]])
	with torch.no_grad():
	score = fusion_model(features).item()

	with open("models/fusion_threshold.json") as f:
	threshold = json.load(f)["optimal_threshold"] # 0.46

	print(f"Fusion score: {score:.3f}")
	print(f"Decision: {'THREAT' if score >= threshold else 'SAFE'}")

	# ── Isolation Forest ──────────────────────────────────────────────────────────
	with open("models/isolation_forest.pkl", "rb") as f:
	iso_forest = pickle.load(f)

	# embedding is a 1024-d numpy array from E5-large-v2
	# anomaly_score = iso_forest.predict(embedding.reshape(1, -1))
	```

	---

	## Evaluation

	\| Metric \| Value \|
	\|---\|---\|
	\| Fusion threshold (Youden J optimised) \| 0.46 \|
	\| Youden J statistic \| 0.9688 \|
	\| Validation ROC-AUC \| 0.21 \|
	\| Test ROC-AUC \| 0.27 \|

	> Note: The low ROC-AUC values reflect the challenge of the task — adversarial prompts are intentionally crafted to evade detection. The Youden J metric (0.9688) measures the balance between sensitivity and specificity at the optimal threshold, indicating strong calibration despite the difficulty of the distribution.

	---

	## Intended Use

	These models are designed exclusively for AI security applications:

	- Detecting prompt injection attacks against LLM systems
	- Identifying jailbreak attempts in real-time
	- Session-level escalation monitoring in multi-turn conversations
	- Anomaly detection on user input to AI assistants

	Out-of-scope uses: General text classification, sentiment analysis, or any purpose unrelated to AI system security.

	---

	## Training Data

	Models were trained on a combination of:

	- JailbreakBench — standardised jailbreak prompt benchmark
	- Prompt injection corpora — curated adversarial prompt datasets
	- Synthetic escalation sequences — programmatically generated multi-turn escalation patterns
	- Safe prompts — ShareGPT conversations, standard assistant interactions (IsolationForest negative class)

	---

	## Limitations

	- Models are optimised for English-language prompts. Performance on other languages is not evaluated.
	- Novel attack patterns not present in training data may evade detection until the Fusion MLP is retrained with feedback.
	- The Trajectory model requires a sequence of at least 2 prompts; single-turn detection relies on IsolationForest and Fusion scores only.
	- These models should be used as one layer in a defence-in-depth strategy, not as the sole security control.

	---

	## Citation

	If you use these models, please cite the AISecOps project:

	```bibtex
	@software{aisecops2026,
	author = {Tarunvoff},
	title = {AISecOps: AI Security Operations Platform},
	year = {2026},
	url = {https://github.com/Tarunvoff/LLM-FIREWALL},
	license = {MIT}
	}
	```

	---

	## License

	Apache License 2.0 — see [LICENSE](https://github.com/Tarunvoff/LLM-FIREWALL/blob/public-release/LICENSE).