Upload folder using huggingface_hub

#1
Files changed (4) hide show
  1. LICENSE +34 -0
  2. README.md +151 -0
  3. repe_config.json +94 -0
  4. steering_vectors.safetensors +3 -0
LICENSE ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
2
+
3
+ Copyright (c) 2024 Scott Thornton / perfecXion.ai
4
+
5
+ AEGIS: Adaptive Ensemble Guard with Integrated Steering
6
+
7
+ You are free to:
8
+ Share — copy and redistribute the material in any medium or format
9
+ Adapt — remix, transform, and build upon the material
10
+
11
+ Under the following terms:
12
+
13
+ Attribution — You must give appropriate credit, provide a link to the license,
14
+ and indicate if changes were made. You may do so in any reasonable manner,
15
+ but not in any way that suggests the licensor endorses you or your use.
16
+
17
+ NonCommercial — You may not use the material for commercial purposes
18
+ without explicit written permission from the copyright holder.
19
+
20
+ ShareAlike — If you remix, transform, or build upon the material, you must
21
+ distribute your contributions under the same license as the original.
22
+
23
+ No additional restrictions — You may not apply legal terms or technological
24
+ measures that legally restrict others from doing anything the license permits.
25
+
26
+ ATTRIBUTION REQUIREMENTS:
27
+ When using this material, you must provide clear attribution including:
28
+ 1. The name "AEGIS" and credit to "scthornton.ai / perfecXion.ai"
29
+ 2. A link to the original repository
30
+ 3. Indication of any modifications made
31
+
32
+ For commercial licensing inquiries, contact: scott@perfecxion.ai
33
+
34
+ Full license text: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ tags:
4
+ - llm-security
5
+ - jailbreak-defense
6
+ - representation-engineering
7
+ - repe
8
+ - steering-vectors
9
+ - safety
10
+ ---
11
+
12
+ # TRYLOCK RepE Steering Vectors
13
+
14
+ Layer 2 of the TRYLOCK (Adaptive Ensemble Guard with Integrated Steering) defense system. These are Representation Engineering (RepE) steering vectors that operate in the model's activation space to enforce safety.
15
+
16
+ ## Description
17
+
18
+ Unlike fine-tuning which modifies model weights, RepE steering works by adding safety-inducing directions to the model's hidden states during inference. This provides:
19
+
20
+ - **Complementary defense** to weight-based methods
21
+ - **Dynamic control** via adjustable steering strength (alpha)
22
+ - **No training required** - works at inference time
23
+
24
+ ## Intended Uses
25
+
26
+ **Primary use cases:**
27
+ - Runtime safety enforcement for LLM inference
28
+ - Research on representation engineering techniques
29
+ - Combining with DPO training for defense-in-depth
30
+ - Dynamic defense adjustment based on threat level
31
+
32
+ **Out of scope:**
33
+ - Models other than Mistral-7B-Instruct-v0.3 (vectors are model-specific)
34
+ - Standalone safety solution (best combined with other layers)
35
+
36
+ ### Vector Details
37
+
38
+ | Parameter | Value |
39
+ |-----------|-------|
40
+ | Base Model | Mistral-7B-Instruct-v0.3 |
41
+ | Method | Contrastive Activation Addition (CAA) |
42
+ | Layers | 12, 14, 16, 18, 20, 22, 24, 26 |
43
+ | Vector Dimension | 4096 |
44
+ | Optimal Alpha | 2.0 |
45
+
46
+ ### Evaluation Results
47
+
48
+ Results with DPO (Layer 1) active:
49
+
50
+ | Alpha | ASR | Analysis |
51
+ |-------|-----|----------|
52
+ | 0.0 | 39.8% | DPO only (no steering) |
53
+ | 1.0 | 32.1% | Moderate steering |
54
+ | 2.0 | 8.0% | **Optimal trade-off** |
55
+ | 2.5 | 0.0% | Maximum protection (high over-refusal) |
56
+
57
+ *Full TRYLOCK (DPO + RepE α=2.0) achieves 82.8% relative ASR reduction from 46.5% baseline.*
58
+
59
+ ## Usage
60
+
61
+ ```python
62
+ import torch
63
+ from safetensors.torch import load_file
64
+ import json
65
+
66
+ # Load steering vectors
67
+ vectors = load_file("steering_vectors.safetensors")
68
+
69
+ # Load config
70
+ with open("repe_config.json") as f:
71
+ config = json.load(f)
72
+
73
+ # Apply steering during forward pass
74
+ def steering_hook(layer_idx, alpha=2.0):
75
+ vector = vectors[f"layer_{layer_idx}"]
76
+ def hook(module, input, output):
77
+ # Add steering vector to hidden states
78
+ hidden_states = output[0]
79
+ hidden_states = hidden_states + alpha * vector.to(hidden_states.device)
80
+ return (hidden_states,) + output[1:]
81
+ return hook
82
+
83
+ # Register hooks on model
84
+ for layer_idx in config["steering_layers"]:
85
+ model.model.layers[layer_idx].register_forward_hook(
86
+ steering_hook(layer_idx, alpha=2.0)
87
+ )
88
+ ```
89
+
90
+ ## Dynamic Alpha with Sidecar
91
+
92
+ For optimal defense, use with the sidecar classifier to dynamically adjust alpha:
93
+
94
+ ```python
95
+ # Sidecar classifies input as SAFE/WARN/ATTACK
96
+ alpha_map = {"SAFE": 0.5, "WARN": 1.5, "ATTACK": 2.5}
97
+ classification = sidecar.classify(prompt)
98
+ alpha = alpha_map[classification]
99
+ ```
100
+
101
+ ## TRYLOCK Architecture
102
+
103
+ These vectors are Layer 2 of the 3-layer TRYLOCK defense:
104
+
105
+ 1. **Layer 1 (KNOWLEDGE)**: [trylock-mistral-7b-dpo](https://huggingface.co/scthornton/trylock-mistral-7b-dpo)
106
+ 2. **Layer 2 (INSTINCT)**: RepE steering vectors (this model)
107
+ 3. **Layer 3 (OVERSIGHT)**: [trylock-sidecar-classifier](https://huggingface.co/scthornton/trylock-sidecar-classifier)
108
+
109
+ ## Limitations and Risks
110
+
111
+ **Limitations:**
112
+ - Vectors are specific to Mistral-7B-Instruct-v0.3 architecture
113
+ - High alpha values (>2.5) may degrade output quality
114
+ - Steering affects all outputs, not just harmful ones
115
+ - Requires careful alpha tuning for optimal balance
116
+
117
+ **Risks:**
118
+ - Over-steering can cause repetitive or incoherent outputs
119
+ - May interfere with legitimate use cases at high alpha
120
+ - Novel attacks may find directions orthogonal to steering vectors
121
+
122
+ **Recommendations:**
123
+ - Use dynamic alpha based on sidecar classification
124
+ - Start with alpha=1.0 and adjust based on use case
125
+ - Combine with DPO training for robust defense
126
+
127
+ ## Citation
128
+
129
+ ```bibtex
130
+ @misc{trylock2024,
131
+ title={TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
132
+ author={Thornton, Scott},
133
+ year={2024},
134
+ url={https://huggingface.co/scthornton/trylock-repe-vectors}
135
+ }
136
+ ```
137
+
138
+ ## License
139
+
140
+ **CC BY-NC-SA 4.0** (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)
141
+
142
+ You are free to:
143
+ - **Share** — copy and redistribute the material in any medium or format
144
+ - **Adapt** — remix, transform, and build upon the material
145
+
146
+ Under the following terms:
147
+ - **Attribution** — You must give appropriate credit to **Scott Thornton**, provide a link to the license, and indicate if changes were made
148
+ - **NonCommercial** — You may not use the material for commercial purposes without explicit written permission
149
+ - **ShareAlike** — If you remix, transform, or build upon the material, you must distribute your contributions under the same license
150
+
151
+ For commercial licensing inquiries, contact: scott@perfecxion.ai
repe_config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "config": {
3
+ "target_layers": [
4
+ 12,
5
+ 14,
6
+ 16,
7
+ 18,
8
+ 20,
9
+ 22,
10
+ 24,
11
+ 26
12
+ ],
13
+ "method": "difference",
14
+ "probe_C": 1.0,
15
+ "alpha_research": 0.0,
16
+ "alpha_balanced": 1.0,
17
+ "alpha_elevated": 1.5,
18
+ "alpha_lockdown": 2.5
19
+ },
20
+ "vectors": [
21
+ {
22
+ "layer_idx": 12,
23
+ "method": "difference",
24
+ "train_accuracy": 0.6800000071525574,
25
+ "val_accuracy": 0.6800000071525574,
26
+ "val_auc": null,
27
+ "mean_norm": 3.609375,
28
+ "std_norm": 0.1
29
+ },
30
+ {
31
+ "layer_idx": 14,
32
+ "method": "difference",
33
+ "train_accuracy": 0.6833333373069763,
34
+ "val_accuracy": 0.6833333373069763,
35
+ "val_auc": null,
36
+ "mean_norm": 4.53125,
37
+ "std_norm": 0.1
38
+ },
39
+ {
40
+ "layer_idx": 16,
41
+ "method": "difference",
42
+ "train_accuracy": 0.6833333373069763,
43
+ "val_accuracy": 0.6833333373069763,
44
+ "val_auc": null,
45
+ "mean_norm": 6.78125,
46
+ "std_norm": 0.1
47
+ },
48
+ {
49
+ "layer_idx": 18,
50
+ "method": "difference",
51
+ "train_accuracy": 0.6833333373069763,
52
+ "val_accuracy": 0.6833333373069763,
53
+ "val_auc": null,
54
+ "mean_norm": 9.5,
55
+ "std_norm": 0.1
56
+ },
57
+ {
58
+ "layer_idx": 20,
59
+ "method": "difference",
60
+ "train_accuracy": 0.6833333373069763,
61
+ "val_accuracy": 0.6833333373069763,
62
+ "val_auc": null,
63
+ "mean_norm": 12.0625,
64
+ "std_norm": 0.1
65
+ },
66
+ {
67
+ "layer_idx": 22,
68
+ "method": "difference",
69
+ "train_accuracy": 0.6800000071525574,
70
+ "val_accuracy": 0.6800000071525574,
71
+ "val_auc": null,
72
+ "mean_norm": 13.6875,
73
+ "std_norm": 0.1
74
+ },
75
+ {
76
+ "layer_idx": 24,
77
+ "method": "difference",
78
+ "train_accuracy": 0.6833333373069763,
79
+ "val_accuracy": 0.6833333373069763,
80
+ "val_auc": null,
81
+ "mean_norm": 15.375,
82
+ "std_norm": 0.1
83
+ },
84
+ {
85
+ "layer_idx": 26,
86
+ "method": "difference",
87
+ "train_accuracy": 0.6833333373069763,
88
+ "val_accuracy": 0.6833333373069763,
89
+ "val_auc": null,
90
+ "mean_norm": 17.875,
91
+ "std_norm": 0.1
92
+ }
93
+ ]
94
+ }
steering_vectors.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:808c56d115e4e8d131d9d599aa079b2aef46010c890c627c8d90b3d27c834cc3
3
+ size 66120