pathcosmos commited on
Commit
0d1f450
·
verified ·
1 Parent(s): fc1a997

Upload dpo-r2/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. dpo-r2/README.md +57 -0
dpo-r2/README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - dpo
8
+ - rlhf
9
+ - alignment
10
+ - lora
11
+ - korean
12
+ - llm
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # EVAFRILL-Mo 3B — DPO Round 2
17
+
18
+ Second DPO alignment round using a conservative learning rate schedule.
19
+ Based on the merged DPO Round 1 checkpoint. LoRA adapters included.
20
+
21
+ ## Training Stage
22
+
23
+ DPO alignment — Round 2. Based on the merged DPO R1 checkpoint.
24
+
25
+ ## Key Details
26
+
27
+ - **Steps**: 2,000
28
+ - **Beta**: 0.05 (reduced from R1 for conservative alignment)
29
+ - **Learning rate**: 1e-7 (conservative)
30
+ - **LoRA weights file**: `lora_weights.pt`
31
+
32
+ ## Metrics
33
+
34
+ | Metric | Value |
35
+ |--------|-------|
36
+ | Beta (alignment strength) | 0.05 |
37
+ | Learning rate | 1e-7 |
38
+
39
+ ## Notes
40
+
41
+ The reduced beta and lower learning rate compared to R1 aim for more conservative preference
42
+ alignment while preserving the capabilities gained during SFT v2 and DPO R1.
43
+ The merged weights from this round are used as one of the two sources for the
44
+ [SLERP merge](../slerp/).
45
+
46
+ ## Main Model Card
47
+
48
+ See the [main README](../../README.md) for full project details, architecture, and training history.
49
+
50
+ ## Usage
51
+
52
+ ```python
53
+ from transformers import AutoModelForCausalLM, AutoTokenizer
54
+ from peft import PeftModel
55
+ base = AutoModelForCausalLM.from_pretrained("path/to/dpo-r2", torch_dtype="bfloat16")
56
+ model = PeftModel.from_pretrained(base, "path/to/dpo-r2")
57
+ ```