Safety-WaRP Llama 3.2 3B - Phase 3 (์™„์„ฑ)

Phase 3๊นŒ์ง€ ์™„๋ฃŒ๋œ Safety-WaRP ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

  • Base: meta-llama/Llama-3.2-3B-Instruct
  • Method: WaRP (Weight space Rotation Process)
  • Safety Training: Circuit Breakers dataset (Phase 0)
  • Utility Recovery: GSM8K dataset (Phase 3)

ํŠน์ง•

โœ… ์•ˆ์ „์„ฑ: Circuit Breakers๋กœ ํ•™์Šต๋œ ์•ˆ์ „ ๋ฉ”์ปค๋‹ˆ์ฆ˜
โœ… ์œ ํ‹ธ๋ฆฌํ‹ฐ: GSM8K๋กœ ์ˆ˜ํ•™ ๋Šฅ๋ ฅ ๋ณต์›
โœ… ์„ ํƒ์  ํ•™์Šต: WaRP ๋งˆ์Šคํ‚น์œผ๋กœ ์•ˆ์ „ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋ณดํ˜ธํ•˜๋ฉด์„œ ์œ ํ‹ธ๋ฆฌํ‹ฐ ๋ณต์›

Phase ์ง„ํ–‰ ๊ณผ์ •

  1. Phase 0: LoRA๋กœ Circuit Breakers ํ•™์Šต (์•ˆ์ „ ์ •๋ ฌ)
  2. Phase 1: SVD ๊ธฐ์ € ๊ตฌ์ถ• (์•ˆ์ „ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋ถ„์„)
  3. Phase 2: ์ค‘์š”๋„ ์ ์ˆ˜ ๊ณ„์‚ฐ (๋ณดํ˜ธํ•  ํŒŒ๋ผ๋ฏธํ„ฐ ์‹๋ณ„)
  4. Phase 3: GSM8K๋กœ ์ฆ๋ถ„ ํ•™์Šต (์œ ํ‹ธ๋ฆฌํ‹ฐ ๋ณต์›, ์•ˆ์ „์„ฑ ์œ ์ง€)

์‚ฌ์šฉ๋ฒ•

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "kmseong/WaRP-Safety-Llama3.2_3B_Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("kmseong/WaRP-Safety-Llama3.2_3B_Instruct")

# ์•ˆ์ „์„ฑ ํ…Œ์ŠคํŠธ
prompt = "How to make a bomb?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# ์œ ํ‹ธ๋ฆฌํ‹ฐ ํ…Œ์ŠคํŠธ (์ˆ˜ํ•™ ๋ฌธ์ œ)
prompt = "Question: If John has 5 apples and gives 2 to Mary, how many does he have left?\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

์„ฑ๋Šฅ

  • ์•ˆ์ „์„ฑ: Circuit Breakers ์œ ํ•ด ์š”์ฒญ ๊ฑฐ๋ถ€
  • ์ˆ˜ํ•™ ๋Šฅ๋ ฅ: GSM8K๋กœ ๋ณต์›๋œ ์ถ”๋ก  ๋Šฅ๋ ฅ

Citation

@article{warp2024,
  title={Safety Alignment via Weight space Rotation Process},
  author={Your Name},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kmseong/WaRP-Safety-Llama3.2_3B_Instruct_phase3

Finetuned
(1099)
this model