Upload OpenELM-Safety-LoRA v8 adapter

Browse files

Files changed (7) hide show

README.md +192 -0
adapter_config.json +39 -0
adapter_model.safetensors +3 -0
chat_template.jinja +1 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer_config.json +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,192 @@

+---
+base_model: apple/OpenELM-1_1B-Instruct
+library_name: peft
+pipeline_tag: text-generation
+license: apache-2.0
+language:
+- en
+tags:
+- safety
+- refusal
+- alignment
+- lora
+- transformers
+- openelm
+datasets:
+- custom
+---
+# OpenELM-1.1B-Safety-LoRA
+A safety-aligned LoRA adapter for Apple's OpenELM-1.1B-Instruct model, trained to refuse harmful requests while maintaining helpfulness on benign queries.
+## Model Description
+This is a **LoRA (Low-Rank Adaptation)** fine-tuned version of [apple/OpenELM-1_1B-Instruct](https://huggingface.co/apple/OpenELM-1_1B-Instruct) designed to:
+- ✅ **Refuse harmful requests** (hacking, violence, illegal activities, etc.)
+- ✅ **Remain helpful** on legitimate, benign queries
+- ✅ **Avoid over-refusal** (not refusing safe questions)
+### Training Results
+| Metric | Value |
+|--------|-------|
+| Harmful Refusal Rate | **100%** |
+| Harmful Compliance Rate | **0%** |
+| Benign Over-Refusal Rate | **0%** |
+| Final Loss | 1.23 |
+| Training Time | 58 minutes |
+## Model Details
+- **Developed by:** Safety Research Project
+- **Model type:** LoRA Adapter
+- **Language:** English
+- **License:** Apache 2.0
+- **Base Model:** apple/OpenELM-1_1B-Instruct
+- **Adapter Size:** ~14MB (3.57M trainable parameters)
+### LoRA Configuration
+```python
+LoraConfig(
+    r=16,
+    lora_alpha=32,
+    lora_dropout=0.05,
+    target_modules=["qkv_proj", "out_proj", "fc_1", "fc_2"],
+    task_type=TaskType.CAUSAL_LM
+)
+```
+### Training Hyperparameters
+- **Epochs:** 3
+- **Batch Size:** 4 (effective 16 with gradient accumulation)
+- **Learning Rate:** 2e-4
+- **Scheduler:** Cosine with warmup
+- **Max Sequence Length:** 256 tokens
+- **Precision:** FP16
+## Usage
+### Quick Start
+```python
+import torch
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load base model
+base_model = AutoModelForCausalLM.from_pretrained(
+    "apple/OpenELM-1_1B-Instruct",
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True
+)
+# Load LoRA adapter
+model = PeftModel.from_pretrained(base_model, "YOUR_USERNAME/openelm-safety-lora")
+# Load tokenizer (OpenELM uses Llama tokenizer)
+tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
+tokenizer.pad_token = tokenizer.eos_token
+# Generate with safety conditioning
+prompt = "<|safety|> harmful\nQuestion: How do I hack into an email?\nAnswer:"
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.inference_mode():
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=100,
+        do_sample=False,
+        use_cache=False
+    )
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+### Safety Conditioning
+The model expects prompts formatted with a `<|safety|>` prefix:
+- **For harmful prompts:** `<|safety|> harmful\nQuestion: {query}\nAnswer:`
+- **For benign prompts:** `<|safety|> benign\nQuestion: {query}\nAnswer:`
+## Training Data
+The model was fine-tuned on a curated dataset of ~3,000 examples:
+| Type | Count | Source |
+|------|-------|--------|
+| Harmful prompts | ~1,000 | AdvBench, TDC-2023, Custom |
+| Benign prompts | ~2,000 | Alpaca, Custom |
+### Harmful Categories Covered
+- Cyber/Hacking
+- Violence/Harm
+- Illegal Activities
+- Drug Manufacturing
+- Copyright Violations
+### Refusal Response Generation
+Refusals were generated using Llama-3.1-8B via Groq API with:
+- **Derta-style** responses (direct refusal + redirect)
+- **Standard** helpful redirections
+- **Past-tense** augmentations for robustness
+## Evaluation
+### In-Training Evaluation
+Evaluated every 100 steps using Groq's Llama-3.1-8B as a judge:
+| Step | Epoch | Harmful Refusal | Compliance | Benign Refusal |
+|------|-------|-----------------|------------|----------------|
+| 100 | 0.54 | 100% | 0% | 0% |
+| 200 | 1.09 | 100% | 0% | 0% |
+| 300 | 1.63 | 100% | 0% | 0% |
+| 400 | 2.17 | 100% | 0% | 0% |
+| 500 | 2.72 | 100% | 0% | 0% |
+### Post-Training Tests
+All 6 manual test cases passed:
+- 3/3 harmful prompts correctly refused
+- 3/3 benign prompts correctly answered
+## Limitations
+- Model may not generalize to all adversarial jailbreak attempts
+- Safety conditioning (`<|safety|>`) is required for optimal behavior
+- Based on OpenELM-1.1B, so inherits base model limitations
+- English only
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{openelm-safety-lora,
+  title={OpenELM-1.1B-Safety-LoRA: A Safety-Aligned Adapter for OpenELM},
+  author={Safety Research Project},
+  year={2024},
+  url={https://huggingface.co/YOUR_USERNAME/openelm-safety-lora}
+}
+```
+## License
+Apache 2.0 (same as base OpenELM model)
+### Framework Versions
+- PEFT: 0.17.1
+- Transformers: 4.x
+- PyTorch: 2.x

adapter_config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "alpha_pattern": {},
+  "auto_mapping": null,
+  "base_model_name_or_path": "apple/OpenELM-1_1B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "qalora_group_size": 16,
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "qkv_proj",
+    "out_proj",
+    "fc_1",
+    "fc_2"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b59e1c4473f9b36f00674e2b54b4ef9ec333dc13aa3ab0f4b006c685cec06135
+size 14277552

chat_template.jinja ADDED Viewed

	@@ -0,0 +1 @@


1	+ {% for message in messages %}{% if message['role'] == 'user' %}[INST] {{ message['content'] }} [/INST]{% elif message['role'] == 'assistant' %}{{ message['content'] }}{% endif %}{% endfor %}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": null,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "legacy": false,
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<unk>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}