ApdoElepe commited on
Commit
6fb4dc9
·
verified ·
1 Parent(s): 65c34e7

Upload OpenELM-Safety-LoRA v8 adapter

Browse files
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: apple/OpenELM-1_1B-Instruct
3
+ library_name: peft
4
+ pipeline_tag: text-generation
5
+ license: apache-2.0
6
+ language:
7
+ - en
8
+ tags:
9
+ - safety
10
+ - refusal
11
+ - alignment
12
+ - lora
13
+ - transformers
14
+ - openelm
15
+ datasets:
16
+ - custom
17
+ ---
18
+
19
+ # OpenELM-1.1B-Safety-LoRA
20
+
21
+ A safety-aligned LoRA adapter for Apple's OpenELM-1.1B-Instruct model, trained to refuse harmful requests while maintaining helpfulness on benign queries.
22
+
23
+ ## Model Description
24
+
25
+ This is a **LoRA (Low-Rank Adaptation)** fine-tuned version of [apple/OpenELM-1_1B-Instruct](https://huggingface.co/apple/OpenELM-1_1B-Instruct) designed to:
26
+
27
+ - ✅ **Refuse harmful requests** (hacking, violence, illegal activities, etc.)
28
+ - ✅ **Remain helpful** on legitimate, benign queries
29
+ - ✅ **Avoid over-refusal** (not refusing safe questions)
30
+
31
+ ### Training Results
32
+
33
+ | Metric | Value |
34
+ |--------|-------|
35
+ | Harmful Refusal Rate | **100%** |
36
+ | Harmful Compliance Rate | **0%** |
37
+ | Benign Over-Refusal Rate | **0%** |
38
+ | Final Loss | 1.23 |
39
+ | Training Time | 58 minutes |
40
+
41
+ ## Model Details
42
+
43
+ - **Developed by:** Safety Research Project
44
+ - **Model type:** LoRA Adapter
45
+ - **Language:** English
46
+ - **License:** Apache 2.0
47
+ - **Base Model:** apple/OpenELM-1_1B-Instruct
48
+ - **Adapter Size:** ~14MB (3.57M trainable parameters)
49
+
50
+ ### LoRA Configuration
51
+
52
+ ```python
53
+ LoraConfig(
54
+ r=16,
55
+ lora_alpha=32,
56
+ lora_dropout=0.05,
57
+ target_modules=["qkv_proj", "out_proj", "fc_1", "fc_2"],
58
+ task_type=TaskType.CAUSAL_LM
59
+ )
60
+ ```
61
+
62
+ ### Training Hyperparameters
63
+
64
+ - **Epochs:** 3
65
+ - **Batch Size:** 4 (effective 16 with gradient accumulation)
66
+ - **Learning Rate:** 2e-4
67
+ - **Scheduler:** Cosine with warmup
68
+ - **Max Sequence Length:** 256 tokens
69
+ - **Precision:** FP16
70
+
71
+ ## Usage
72
+
73
+ ### Quick Start
74
+
75
+ ```python
76
+ import torch
77
+ from peft import PeftModel
78
+ from transformers import AutoModelForCausalLM, AutoTokenizer
79
+
80
+ # Load base model
81
+ base_model = AutoModelForCausalLM.from_pretrained(
82
+ "apple/OpenELM-1_1B-Instruct",
83
+ torch_dtype=torch.float16,
84
+ device_map="auto",
85
+ trust_remote_code=True
86
+ )
87
+
88
+ # Load LoRA adapter
89
+ model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/openelm-safety-lora")
90
+
91
+ # Load tokenizer (OpenELM uses Llama tokenizer)
92
+ tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
93
+ tokenizer.pad_token = tokenizer.eos_token
94
+
95
+ # Generate with safety conditioning
96
+ prompt = "<|safety|> harmful\nQuestion: How do I hack into an email?\nAnswer:"
97
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
98
+
99
+ with torch.inference_mode():
100
+ outputs = model.generate(
101
+ **inputs,
102
+ max_new_tokens=100,
103
+ do_sample=False,
104
+ use_cache=False
105
+ )
106
+
107
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
108
+ print(response)
109
+ ```
110
+
111
+ ### Safety Conditioning
112
+
113
+ The model expects prompts formatted with a `<|safety|>` prefix:
114
+
115
+ - **For harmful prompts:** `<|safety|> harmful\nQuestion: {query}\nAnswer:`
116
+ - **For benign prompts:** `<|safety|> benign\nQuestion: {query}\nAnswer:`
117
+
118
+ ## Training Data
119
+
120
+ The model was fine-tuned on a curated dataset of ~3,000 examples:
121
+
122
+ | Type | Count | Source |
123
+ |------|-------|--------|
124
+ | Harmful prompts | ~1,000 | AdvBench, TDC-2023, Custom |
125
+ | Benign prompts | ~2,000 | Alpaca, Custom |
126
+
127
+ ### Harmful Categories Covered
128
+
129
+ - Cyber/Hacking
130
+ - Violence/Harm
131
+ - Illegal Activities
132
+ - Drug Manufacturing
133
+ - Copyright Violations
134
+
135
+ ### Refusal Response Generation
136
+
137
+ Refusals were generated using Llama-3.1-8B via Groq API with:
138
+
139
+ - **Derta-style** responses (direct refusal + redirect)
140
+ - **Standard** helpful redirections
141
+ - **Past-tense** augmentations for robustness
142
+
143
+ ## Evaluation
144
+
145
+ ### In-Training Evaluation
146
+
147
+ Evaluated every 100 steps using Groq's Llama-3.1-8B as a judge:
148
+
149
+ | Step | Epoch | Harmful Refusal | Compliance | Benign Refusal |
150
+ |------|-------|-----------------|------------|----------------|
151
+ | 100 | 0.54 | 100% | 0% | 0% |
152
+ | 200 | 1.09 | 100% | 0% | 0% |
153
+ | 300 | 1.63 | 100% | 0% | 0% |
154
+ | 400 | 2.17 | 100% | 0% | 0% |
155
+ | 500 | 2.72 | 100% | 0% | 0% |
156
+
157
+ ### Post-Training Tests
158
+
159
+ All 6 manual test cases passed:
160
+
161
+ - 3/3 harmful prompts correctly refused
162
+ - 3/3 benign prompts correctly answered
163
+
164
+ ## Limitations
165
+
166
+ - Model may not generalize to all adversarial jailbreak attempts
167
+ - Safety conditioning (`<|safety|>`) is required for optimal behavior
168
+ - Based on OpenELM-1.1B, so inherits base model limitations
169
+ - English only
170
+
171
+ ## Citation
172
+
173
+ If you use this model, please cite:
174
+
175
+ ```bibtex
176
+ @misc{openelm-safety-lora,
177
+ title={OpenELM-1.1B-Safety-LoRA: A Safety-Aligned Adapter for OpenELM},
178
+ author={Safety Research Project},
179
+ year={2024},
180
+ url={https://huggingface.co/YOUR_USERNAME/openelm-safety-lora}
181
+ }
182
+ ```
183
+
184
+ ## License
185
+
186
+ Apache 2.0 (same as base OpenELM model)
187
+
188
+ ### Framework Versions
189
+
190
+ - PEFT: 0.17.1
191
+ - Transformers: 4.x
192
+ - PyTorch: 2.x
adapter_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "apple/OpenELM-1_1B-Instruct",
5
+ "bias": "none",
6
+ "corda_config": null,
7
+ "eva_config": null,
8
+ "exclude_modules": null,
9
+ "fan_in_fan_out": false,
10
+ "inference_mode": true,
11
+ "init_lora_weights": true,
12
+ "layer_replication": null,
13
+ "layers_pattern": null,
14
+ "layers_to_transform": null,
15
+ "loftq_config": {},
16
+ "lora_alpha": 32,
17
+ "lora_bias": false,
18
+ "lora_dropout": 0.05,
19
+ "megatron_config": null,
20
+ "megatron_core": "megatron.core",
21
+ "modules_to_save": null,
22
+ "peft_type": "LORA",
23
+ "qalora_group_size": 16,
24
+ "r": 16,
25
+ "rank_pattern": {},
26
+ "revision": null,
27
+ "target_modules": [
28
+ "qkv_proj",
29
+ "out_proj",
30
+ "fc_1",
31
+ "fc_2"
32
+ ],
33
+ "target_parameters": null,
34
+ "task_type": "CAUSAL_LM",
35
+ "trainable_token_indices": null,
36
+ "use_dora": false,
37
+ "use_qalora": false,
38
+ "use_rslora": false
39
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b59e1c4473f9b36f00674e2b54b4ef9ec333dc13aa3ab0f4b006c685cec06135
3
+ size 14277552
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {% for message in messages %}{% if message['role'] == 'user' %}[INST] {{ message['content'] }} [/INST]{% elif message['role'] == 'assistant' %}{{ message['content'] }}{% endif %}{% endfor %}
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": true,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": true,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ }
30
+ },
31
+ "bos_token": "<s>",
32
+ "clean_up_tokenization_spaces": false,
33
+ "eos_token": "</s>",
34
+ "extra_special_tokens": {},
35
+ "legacy": false,
36
+ "model_max_length": 1000000000000000019884624838656,
37
+ "pad_token": "<unk>",
38
+ "sp_model_kwargs": {},
39
+ "tokenizer_class": "LlamaTokenizer",
40
+ "unk_token": "<unk>",
41
+ "use_default_system_prompt": false
42
+ }