Upload dpo-r3/README.md with huggingface_hub

Browse files

Files changed (1) hide show

dpo-r3/README.md +62 -0

dpo-r3/README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+language:
+  - ko
+  - en
+license: apache-2.0
+tags:
+  - dpo
+  - alignment
+  - experimental
+  - self-play
+  - korean
+  - llm
+pipeline_tag: text-generation
+---
+# EVAFRILL-Mo 3B — DPO Round 3 (Experimental)
+Experimental DPO round targeting repetition behavior using self-generated preference pairs.
+Based on the SLERP merged checkpoint.
+## Training Stage
+DPO alignment — Round 3. Based on the SLERP checkpoint.
+Preference data was self-generated by the model with a focus on repetition reduction.
+## Key Details
+- **Steps**: 1,000
+- **Preference pairs**: 105 self-generated, repetition-targeted
+- **Base checkpoint**: SLERP merge
+## Metrics
+| Metric | Value |
+|--------|-------|
+| Preference pairs used | 105 |
+| Fraction of training data | ~0.015% (105 / 684K) |
+| Outcome | Negligible impact — pairs too diluted |
+## Notes
+This is an **experimental** variant. The 105 self-generated repetition-targeted preference
+pairs represented only ~0.015% of the total training data (684K pairs), resulting in
+negligible signal for the targeted behavior. The experiment demonstrates that self-play
+preference data requires a sufficient volume relative to the full dataset to have measurable
+effect.
+Not recommended for production use. Included for research reproducibility.
+For best results, use the [SLERP variant](../slerp/).
+## Main Model Card
+See the [main README](../../README.md) for full project details, architecture, and training history.
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("path/to/dpo-r3", torch_dtype="bfloat16")
+tokenizer = AutoTokenizer.from_pretrained("path/to/dpo-r3")
+inputs = tokenizer("<|user|>\n질문을 여기에 입력하세요\n<|assistant|>\n", return_tensors="pt")
+```