pathcosmos commited on
Commit
a5d706d
·
verified ·
1 Parent(s): 82c8a94

Upload dpo-r3/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. dpo-r3/README.md +62 -0
dpo-r3/README.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ko
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - dpo
8
+ - alignment
9
+ - experimental
10
+ - self-play
11
+ - korean
12
+ - llm
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # EVAFRILL-Mo 3B — DPO Round 3 (Experimental)
17
+
18
+ Experimental DPO round targeting repetition behavior using self-generated preference pairs.
19
+ Based on the SLERP merged checkpoint.
20
+
21
+ ## Training Stage
22
+
23
+ DPO alignment — Round 3. Based on the SLERP checkpoint.
24
+ Preference data was self-generated by the model with a focus on repetition reduction.
25
+
26
+ ## Key Details
27
+
28
+ - **Steps**: 1,000
29
+ - **Preference pairs**: 105 self-generated, repetition-targeted
30
+ - **Base checkpoint**: SLERP merge
31
+
32
+ ## Metrics
33
+
34
+ | Metric | Value |
35
+ |--------|-------|
36
+ | Preference pairs used | 105 |
37
+ | Fraction of training data | ~0.015% (105 / 684K) |
38
+ | Outcome | Negligible impact — pairs too diluted |
39
+
40
+ ## Notes
41
+
42
+ This is an **experimental** variant. The 105 self-generated repetition-targeted preference
43
+ pairs represented only ~0.015% of the total training data (684K pairs), resulting in
44
+ negligible signal for the targeted behavior. The experiment demonstrates that self-play
45
+ preference data requires a sufficient volume relative to the full dataset to have measurable
46
+ effect.
47
+
48
+ Not recommended for production use. Included for research reproducibility.
49
+ For best results, use the [SLERP variant](../slerp/).
50
+
51
+ ## Main Model Card
52
+
53
+ See the [main README](../../README.md) for full project details, architecture, and training history.
54
+
55
+ ## Usage
56
+
57
+ ```python
58
+ from transformers import AutoModelForCausalLM, AutoTokenizer
59
+ model = AutoModelForCausalLM.from_pretrained("path/to/dpo-r3", torch_dtype="bfloat16")
60
+ tokenizer = AutoTokenizer.from_pretrained("path/to/dpo-r3")
61
+ inputs = tokenizer("<|user|>\n질문을 여기에 입력하세요\n<|assistant|>\n", return_tensors="pt")
62
+ ```