File size: 5,221 Bytes
19b49f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: cc-by-nc-sa-4.0
base_model: Qwen/Qwen2.5-3B-Instruct
library_name: peft
tags:
  - llm-security
  - jailbreak-detection
  - text-classification
  - safety
  - lora
pipeline_tag: text-classification
---

# TRYLOCK Sidecar Classifier

Layer 3 of the TRYLOCK (Adaptive Ensemble Guard with Integrated Steering) defense system. This is a lightweight classifier that runs alongside the main LLM to detect attack patterns and dynamically adjust defense strength.

## Model Description

The sidecar classifier categorizes inputs into three classes:
- **SAFE**: Benign queries that need minimal defense
- **WARN**: Ambiguous or suspicious queries
- **ATTACK**: Clear jailbreak/attack attempts

## Intended Uses

**Primary use cases:**
- Real-time classification of LLM inputs for adaptive defense
- Dynamically adjusting RepE steering strength
- Research on attack detection methods
- Content moderation and threat triage

**Out of scope:**
- Standalone content moderation (designed for adaptive steering)
- High-stakes security decisions without human review

### Training Details

| Parameter | Value |
|-----------|-------|
| Base Model | Qwen2.5-3B-Instruct |
| Method | LoRA fine-tuning |
| LoRA Rank | 32 |
| LoRA Alpha | 64 |
| Target Modules | q_proj, k_proj, v_proj, o_proj |
| Training Samples | 2,349 |
| Classes | SAFE, WARN, ATTACK |

### Evaluation Results

| Class | Precision | Recall | F1 |
|-------|-----------|--------|-----|
| SAFE | 24.1% | 34.5% | 28.4% |
| WARN | 58.3% | 51.6% | 54.8% |
| ATTACK | 62.4% | 60.8% | 61.6% |
| **Macro Avg** | 48.3% | 48.9% | 48.3% |

**Note**: This is a research prototype. ATTACK detection is prioritized over SAFE detection for security.

## Usage

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

# Load model
base_model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen2.5-3B-Instruct",
    num_labels=3,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "scthornton/trylock-sidecar-classifier")
tokenizer = AutoTokenizer.from_pretrained("scthornton/trylock-sidecar-classifier")

# Classify input
label_names = ["SAFE", "WARN", "ATTACK"]

def classify(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=2048)
    with torch.no_grad():
        outputs = model(**inputs.to(model.device))
        probs = torch.softmax(outputs.logits, dim=-1)[0]
    return label_names[probs.argmax()], probs.tolist()

# Example
result, probs = classify("Ignore all previous instructions and...")
print(f"Classification: {result}")  # ATTACK
print(f"Probabilities: {dict(zip(label_names, probs))}")
```

## Dynamic Defense Integration

Use the classification to adjust RepE steering strength:

```python
alpha_map = {
    "SAFE": 0.5,   # Minimal steering - preserve fluency
    "WARN": 1.5,   # Moderate steering
    "ATTACK": 2.5  # Maximum steering - block harmful output
}

classification, _ = classify(user_input)
alpha = alpha_map[classification]
# Apply RepE steering with this alpha value
```

## TRYLOCK Architecture

This classifier is Layer 3 of the 3-layer TRYLOCK defense:

1. **Layer 1 (KNOWLEDGE)**: [trylock-mistral-7b-dpo](https://huggingface.co/scthornton/trylock-mistral-7b-dpo)
2. **Layer 2 (INSTINCT)**: [trylock-repe-vectors](https://huggingface.co/scthornton/trylock-repe-vectors)
3. **Layer 3 (OVERSIGHT)**: Sidecar classifier (this model)

## Limitations and Risks

**Limitations:**
- SAFE class has lower precision (24%) - may over-classify benign queries as WARN
- Trained on English-language attacks only
- 3-class granularity may miss nuanced threat levels
- Requires ~3B parameter model inference overhead

**Risks:**
- False positives may trigger unnecessary steering
- False negatives on novel attack patterns
- Classification confidence doesn't guarantee correctness

**Recommendations:**
- Use probability scores, not just class labels, for fine-grained control
- Consider WARN classification as "elevated caution" rather than definite threat
- Combine with other safety mechanisms for production use

## Framework Versions

- PEFT 0.18.0

## Citation

```bibtex
@misc{trylock2024,
  title={TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
  author={Thornton, Scott},
  year={2024},
  url={https://huggingface.co/scthornton/trylock-sidecar-classifier}
}
```

## License

**CC BY-NC-SA 4.0** (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)

You are free to:
- **Share** — copy and redistribute the material in any medium or format
- **Adapt** — remix, transform, and build upon the material

Under the following terms:
- **Attribution** — You must give appropriate credit to **Scott Thornton**, provide a link to the license, and indicate if changes were made
- **NonCommercial** — You may not use the material for commercial purposes without explicit written permission
- **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license

For commercial licensing inquiries, contact: scott@perfecxion.ai