OpenELM-1.1B-Safety-LoRA
A safety-aligned LoRA adapter for Apple's OpenELM-1.1B-Instruct model, trained to refuse harmful requests while maintaining helpfulness on benign queries.
Model Description
This is a LoRA (Low-Rank Adaptation) fine-tuned version of apple/OpenELM-1_1B-Instruct designed to:
- โ Refuse harmful requests (hacking, violence, illegal activities, etc.)
- โ Remain helpful on legitimate, benign queries
- โ Avoid over-refusal (not refusing safe questions)
Training Results
| Metric | Value |
|---|---|
| Harmful Refusal Rate | 100% |
| Harmful Compliance Rate | 0% |
| Benign Over-Refusal Rate | 0% |
| Final Loss | 1.23 |
| Training Time | 58 minutes |
Model Details
- Developed by: Abdelrahman A. Alshames
- Model type: LoRA Adapter
- Language: English
- License: Apache 2.0
- Base Model: apple/OpenELM-1_1B-Instruct
- Adapter Size: ~14MB (3.57M trainable parameters)
LoRA Configuration
LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["qkv_proj", "out_proj", "fc_1", "fc_2"],
task_type=TaskType.CAUSAL_LM
)
Training Hyperparameters
- Epochs: 3
- Batch Size: 4 (effective 16 with gradient accumulation)
- Learning Rate: 2e-4
- Scheduler: Cosine with warmup
- Max Sequence Length: 256 tokens
- Precision: FP16
Usage
Quick Start
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"apple/OpenELM-1_1B-Instruct",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "ApdoElepe/openelm-safety-lora")
# Load tokenizer (OpenELM uses Llama tokenizer)
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# Generate with safety conditioning
prompt = "<|safety|> harmful\nQuestion: How do I hack into an email?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
use_cache=False
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Safety Conditioning
The model expects prompts formatted with a <|safety|> prefix:
- For harmful prompts:
<|safety|> harmful\nQuestion: {query}\nAnswer: - For benign prompts:
<|safety|> benign\nQuestion: {query}\nAnswer:
Training Data
The model was fine-tuned on a curated dataset of ~3,000 examples, available at ApdoElepe/openelm-safety-dataset:
| Type | Count | Source |
|---|---|---|
| Harmful prompts | ~1,000 | AdvBench, TDC-2023, Custom |
| Benign prompts | ~2,000 | Alpaca, Custom |
Harmful Categories Covered
- Cyber/Hacking
- Violence/Harm
- Illegal Activities
- Drug Manufacturing
- Copyright Violations
Refusal Response Generation
Refusals were generated using Llama-3.1-8B via Groq API with:
- Derta-style responses (direct refusal + redirect)
- Standard helpful redirections
- Past-tense augmentations for robustness
Evaluation
In-Training Evaluation
Evaluated every 100 steps using Groq's Llama-3.1-8B as a judge:
| Step | Epoch | Harmful Refusal | Compliance | Benign Refusal |
|---|---|---|---|---|
| 100 | 0.54 | 100% | 0% | 0% |
| 200 | 1.09 | 100% | 0% | 0% |
| 300 | 1.63 | 100% | 0% | 0% |
| 400 | 2.17 | 100% | 0% | 0% |
| 500 | 2.72 | 100% | 0% | 0% |
Post-Training Tests
All 6 manual test cases passed:
- 3/3 harmful prompts correctly refused
- 3/3 benign prompts correctly answered
Limitations
- Model may not generalize to all adversarial jailbreak attempts
- Safety conditioning (
<|safety|>) is required for optimal behavior - Based on OpenELM-1.1B, so inherits base model limitations
- English only
Citation
If you use this model, please cite:
@misc{openelm-safety-lora,
title={OpenELM-1.1B-Safety-LoRA: A Safety-Aligned Adapter for OpenELM},
author={Abdelrahman A. Alshames},
year={2025},
url={https://huggingface.co/ApdoElepe/openelm-safety-lora}
}
License
Apache 2.0 (same as base OpenELM model)
Framework Versions
- PEFT: 0.17.1
- Transformers: 4.x
- PyTorch: 2.x
- Downloads last month
- 4
Model tree for ApdoElepe/openelm-safety-lora
Base model
apple/OpenELM-1_1B-Instruct