Upload dpo-r2/README.md with huggingface_hub

Files changed (1) hide show

dpo-r2/README.md ADDED Viewed

+---
+language:
+  - ko
+  - en
+license: apache-2.0
+tags:
+  - dpo
+  - rlhf
+  - alignment
+  - lora
+  - korean
+  - llm
+pipeline_tag: text-generation
+---
+# EVAFRILL-Mo 3B — DPO Round 2
+Second DPO alignment round using a conservative learning rate schedule.
+Based on the merged DPO Round 1 checkpoint. LoRA adapters included.
+## Training Stage
+DPO alignment — Round 2. Based on the merged DPO R1 checkpoint.
+## Key Details
+- **Steps**: 2,000
+- **Beta**: 0.05 (reduced from R1 for conservative alignment)
+- **Learning rate**: 1e-7 (conservative)
+- **LoRA weights file**: `lora_weights.pt`
+## Metrics
+| Metric | Value |
+|--------|-------|
+| Beta (alignment strength) | 0.05 |
+| Learning rate | 1e-7 |
+## Notes
+The reduced beta and lower learning rate compared to R1 aim for more conservative preference
+alignment while preserving the capabilities gained during SFT v2 and DPO R1.
+The merged weights from this round are used as one of the two sources for the
+[SLERP merge](../slerp/).
+## Main Model Card
+See the [main README](../../README.md) for full project details, architecture, and training history.
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base = AutoModelForCausalLM.from_pretrained("path/to/dpo-r2", torch_dtype="bfloat16")
+model = PeftModel.from_pretrained(base, "path/to/dpo-r2")
+```