OpenELM-1.1B-Safety-LoRA

A safety-aligned LoRA adapter for Apple's OpenELM-1.1B-Instruct model, trained to refuse harmful requests while maintaining helpfulness on benign queries.

Model Description

This is a LoRA (Low-Rank Adaptation) fine-tuned version of apple/OpenELM-1_1B-Instruct designed to:

  • โœ… Refuse harmful requests (hacking, violence, illegal activities, etc.)
  • โœ… Remain helpful on legitimate, benign queries
  • โœ… Avoid over-refusal (not refusing safe questions)

Training Results

Metric Value
Harmful Refusal Rate 100%
Harmful Compliance Rate 0%
Benign Over-Refusal Rate 0%
Final Loss 1.23
Training Time 58 minutes

Model Details

  • Developed by: Abdelrahman A. Alshames
  • Model type: LoRA Adapter
  • Language: English
  • License: Apache 2.0
  • Base Model: apple/OpenELM-1_1B-Instruct
  • Adapter Size: ~14MB (3.57M trainable parameters)

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["qkv_proj", "out_proj", "fc_1", "fc_2"],
    task_type=TaskType.CAUSAL_LM
)

Training Hyperparameters

  • Epochs: 3
  • Batch Size: 4 (effective 16 with gradient accumulation)
  • Learning Rate: 2e-4
  • Scheduler: Cosine with warmup
  • Max Sequence Length: 256 tokens
  • Precision: FP16

Usage

Quick Start

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "apple/OpenELM-1_1B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "ApdoElepe/openelm-safety-lora")

# Load tokenizer (OpenELM uses Llama tokenizer)
tokenizer = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# Generate with safety conditioning
prompt = "<|safety|> harmful\nQuestion: How do I hack into an email?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        use_cache=False
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Safety Conditioning

The model expects prompts formatted with a <|safety|> prefix:

  • For harmful prompts: <|safety|> harmful\nQuestion: {query}\nAnswer:
  • For benign prompts: <|safety|> benign\nQuestion: {query}\nAnswer:

Training Data

The model was fine-tuned on a curated dataset of ~3,000 examples, available at ApdoElepe/openelm-safety-dataset:

Type Count Source
Harmful prompts ~1,000 AdvBench, TDC-2023, Custom
Benign prompts ~2,000 Alpaca, Custom

Harmful Categories Covered

  • Cyber/Hacking
  • Violence/Harm
  • Illegal Activities
  • Drug Manufacturing
  • Copyright Violations

Refusal Response Generation

Refusals were generated using Llama-3.1-8B via Groq API with:

  • Derta-style responses (direct refusal + redirect)
  • Standard helpful redirections
  • Past-tense augmentations for robustness

Evaluation

In-Training Evaluation

Evaluated every 100 steps using Groq's Llama-3.1-8B as a judge:

Step Epoch Harmful Refusal Compliance Benign Refusal
100 0.54 100% 0% 0%
200 1.09 100% 0% 0%
300 1.63 100% 0% 0%
400 2.17 100% 0% 0%
500 2.72 100% 0% 0%

Post-Training Tests

All 6 manual test cases passed:

  • 3/3 harmful prompts correctly refused
  • 3/3 benign prompts correctly answered

Limitations

  • Model may not generalize to all adversarial jailbreak attempts
  • Safety conditioning (<|safety|>) is required for optimal behavior
  • Based on OpenELM-1.1B, so inherits base model limitations
  • English only

Citation

If you use this model, please cite:

@misc{openelm-safety-lora,
  title={OpenELM-1.1B-Safety-LoRA: A Safety-Aligned Adapter for OpenELM},
  author={Abdelrahman A. Alshames},
  year={2025},
  url={https://huggingface.co/ApdoElepe/openelm-safety-lora}
}

License

Apache 2.0 (same as base OpenELM model)

Framework Versions

  • PEFT: 0.17.1
  • Transformers: 4.x
  • PyTorch: 2.x
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ApdoElepe/openelm-safety-lora

Adapter
(1)
this model