Safety-WaRP Llama 3.2 3B - Phase 0
Phase 0: Base Safety Training - Circuit Breakers ๋ฐ์ดํฐ๋ก ์์ ํ์ต ์๋ฃํ ๋ชจ๋ธ์ ๋๋ค.
Model Details
- Base Model: meta-llama/Llama-3.2-3B-Instruct
- Method: Safety-WaRP (Weight space Rotation Process)
- Phase: Phase 0 (Base Safety Training)
- Safety Dataset: Circuit Breakers
- Training Samples: 1000
- Epochs: 3
- Final Loss: N/A
Training Information
Phase 0: Base Safety Training
Phase 0๋ ์์ ๋ฐ์ดํฐ(Circuit Breakers)๋ก ๋ชจ๋ธ์ ํ์ต์์ผ ์์ ๋ฉ์ปค๋์ฆ์ ๊ตฌ์ถํ๋ ๋จ๊ณ์ ๋๋ค.
์ ์ฐจ:
- Circuit Breakers ๋ฐ์ดํฐ๋ก fine-tuning
- Gradient accumulation (effective batch size: 8)
- 8-bit optimizer๋ก ๋ฉ๋ชจ๋ฆฌ ์ ์ฝ
- Cosine scheduler (lr: 1e-5 โ 0)
๊ฒฐ๊ณผ:
- ์์ ์๋ต ๋ฅ๋ ฅ์ ๊ฐ์ถ ๊ธฐ๋ณธ ๋ชจ๋ธ
- Phase 1/2/3์ ๊ธฐ๋ฐ ๋ชจ๋ธ๋ก ์ฌ์ฉ
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("kmseong/safety-warp-llama-3.2-3b-phase0_20260213_230047")
tokenizer = AutoTokenizer.from_pretrained("kmseong/safety-warp-llama-3.2-3b-phase0_20260213_230047")
# ์์ ํ
์คํธ
prompt = "How to make a bomb?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))
# Expected: ๊ฑฐ๋ถ ์๋ต (์์ ํ์ต ์๋ฃ)
Model Architecture
- Parameters: 3.2B
- Architecture: Llama 3.2
- Precision: bfloat16
- Gradient Checkpointing: Enabled
Training Configuration
{
"epochs": 3,
"learning_rate": 1e-5,
"batch_size": 2,
"gradient_accumulation_steps": 4,
"effective_batch_size": 8,
"optimizer": "AdamW8bit",
"scheduler": "CosineAnnealingLR",
"weight_decay": 0.01
}
Next Steps
์ด ๋ชจ๋ธ์ WaRP ํ์ดํ๋ผ์ธ์ Phase 0 ์๋ฃ ์ํ์ ๋๋ค.
ํ์ ๋จ๊ณ:
- Phase 1: Basis Construction (SVD๋ก basis ๋ฒกํฐ ์ถ์ถ)
- Phase 2: Importance Scoring (์ค์ ํ๋ผ๋ฏธํฐ ์๋ณ)
- Phase 3: Incremental Learning (GSM8K๋ก ์ ํธ๋ฆฌํฐ ๋ณต์)
Safety Notice
โ ๏ธ Phase 0 ์๋ฃ ๋ชจ๋ธ: ์์ ํ์ต์ ์๋ฃ๋์์ผ๋, ์ ํธ๋ฆฌํฐ(์ํ/์ถ๋ก ) ๋ฅ๋ ฅ์ด ์ ํ๋์์ ์ ์์ต๋๋ค.
Phase 3๊น์ง ์๋ฃ๋ ๋ชจ๋ธ์ ์ฌ์ฉํ์๋ฉด ์์ ์ฑ๊ณผ ์ ํธ๋ฆฌํฐ๊ฐ ๊ท ํ์กํ ๋ชจ๋ธ์ ์ฌ์ฉํ์ค ์ ์์ต๋๋ค.
Citation
@misc{safety-warp-phase0,
title={Safety-WaRP Llama 3.2 3B - Phase 0: Base Safety Training},
author={Min-Seong Kim},
year={2026},
howpublished={\url{https://huggingface.co/kmseong/safety-warp-llama-3.2-3b-phase0_20260213_230047}}
}
License
This model follows the Llama 3.2 license.
Contact
For questions or issues, please open an issue on the model repository.
- Downloads last month
- 302
Model tree for kmseong/Llama-3.2-3B-SSFT
Base model
meta-llama/Llama-3.2-3B-Instruct