File size: 5,221 Bytes
19b49f2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
license: cc-by-nc-sa-4.0
base_model: Qwen/Qwen2.5-3B-Instruct
library_name: peft
tags:
- llm-security
- jailbreak-detection
- text-classification
- safety
- lora
pipeline_tag: text-classification
---
# TRYLOCK Sidecar Classifier
Layer 3 of the TRYLOCK (Adaptive Ensemble Guard with Integrated Steering) defense system. This is a lightweight classifier that runs alongside the main LLM to detect attack patterns and dynamically adjust defense strength.
## Model Description
The sidecar classifier categorizes inputs into three classes:
- **SAFE**: Benign queries that need minimal defense
- **WARN**: Ambiguous or suspicious queries
- **ATTACK**: Clear jailbreak/attack attempts
## Intended Uses
**Primary use cases:**
- Real-time classification of LLM inputs for adaptive defense
- Dynamically adjusting RepE steering strength
- Research on attack detection methods
- Content moderation and threat triage
**Out of scope:**
- Standalone content moderation (designed for adaptive steering)
- High-stakes security decisions without human review
### Training Details
| Parameter | Value |
|-----------|-------|
| Base Model | Qwen2.5-3B-Instruct |
| Method | LoRA fine-tuning |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| Target Modules | q_proj, k_proj, v_proj, o_proj |
| Training Samples | 2,349 |
| Classes | SAFE, WARN, ATTACK |
### Evaluation Results
| Class | Precision | Recall | F1 |
|-------|-----------|--------|-----|
| SAFE | 24.1% | 34.5% | 28.4% |
| WARN | 58.3% | 51.6% | 54.8% |
| ATTACK | 62.4% | 60.8% | 61.6% |
| **Macro Avg** | 48.3% | 48.9% | 48.3% |
**Note**: This is a research prototype. ATTACK detection is prioritized over SAFE detection for security.
## Usage
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
# Load model
base_model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-3B-Instruct",
num_labels=3,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "scthornton/trylock-sidecar-classifier")
tokenizer = AutoTokenizer.from_pretrained("scthornton/trylock-sidecar-classifier")
# Classify input
label_names = ["SAFE", "WARN", "ATTACK"]
def classify(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
outputs = model(**inputs.to(model.device))
probs = torch.softmax(outputs.logits, dim=-1)[0]
return label_names[probs.argmax()], probs.tolist()
# Example
result, probs = classify("Ignore all previous instructions and...")
print(f"Classification: {result}") # ATTACK
print(f"Probabilities: {dict(zip(label_names, probs))}")
```
## Dynamic Defense Integration
Use the classification to adjust RepE steering strength:
```python
alpha_map = {
"SAFE": 0.5, # Minimal steering - preserve fluency
"WARN": 1.5, # Moderate steering
"ATTACK": 2.5 # Maximum steering - block harmful output
}
classification, _ = classify(user_input)
alpha = alpha_map[classification]
# Apply RepE steering with this alpha value
```
## TRYLOCK Architecture
This classifier is Layer 3 of the 3-layer TRYLOCK defense:
1. **Layer 1 (KNOWLEDGE)**: [trylock-mistral-7b-dpo](https://huggingface.co/scthornton/trylock-mistral-7b-dpo)
2. **Layer 2 (INSTINCT)**: [trylock-repe-vectors](https://huggingface.co/scthornton/trylock-repe-vectors)
3. **Layer 3 (OVERSIGHT)**: Sidecar classifier (this model)
## Limitations and Risks
**Limitations:**
- SAFE class has lower precision (24%) - may over-classify benign queries as WARN
- Trained on English-language attacks only
- 3-class granularity may miss nuanced threat levels
- Requires ~3B parameter model inference overhead
**Risks:**
- False positives may trigger unnecessary steering
- False negatives on novel attack patterns
- Classification confidence doesn't guarantee correctness
**Recommendations:**
- Use probability scores, not just class labels, for fine-grained control
- Consider WARN classification as "elevated caution" rather than definite threat
- Combine with other safety mechanisms for production use
## Framework Versions
- PEFT 0.18.0
## Citation
```bibtex
@misc{trylock2024,
title={TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
author={Thornton, Scott},
year={2024},
url={https://huggingface.co/scthornton/trylock-sidecar-classifier}
}
```
## License
**CC BY-NC-SA 4.0** (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)
You are free to:
- **Share** — copy and redistribute the material in any medium or format
- **Adapt** — remix, transform, and build upon the material
Under the following terms:
- **Attribution** — You must give appropriate credit to **Scott Thornton**, provide a link to the license, and indicate if changes were made
- **NonCommercial** — You may not use the material for commercial purposes without explicit written permission
- **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license
For commercial licensing inquiries, contact: scott@perfecxion.ai
|