---
license: llama3.2
base_model: devwoo/Kybalion-1B
tags:
- dpo
- rlhf
- llama
- llama-3.2
- kybalion
- trl
- peft
- lora
- merged
datasets:
- argilla/ultrafeedback-binarized-preferences-cleaned
language:
- en
library_name: transformers
pipeline_tag: text-generation
---

# Kybalion-1B-DPO

A 1B-parameter instruct model aligned on top of [devwoo/Kybalion-1B](https://huggingface.co/devwoo/Kybalion-1B) (Llama 3.2 1B + CPT + SFT) using **Direct Preference Optimization (DPO)**. The LoRA adapter has been merged into the base weights so this is a standalone instruct model — no PEFT adapter required at inference.

> ⚠️ **Experimental / educational model.** Training used conservative hyperparameters, so quality improvements over the base Kybalion-1B are modest. See [Limitations](#limitations--known-issues) for an honest assessment.

## Model Details

| Item | Value |
|------|-----|
| Base | [devwoo/Kybalion-1B](https://huggingface.co/devwoo/Kybalion-1B) (built on Llama 3.2 1B) |
| Parameters | 1.24 B |
| Precision | BF16 |
| Context length | 2048 |
| Tokenizer | Llama 3.2 standard |
| Chat template | Llama 3.2 (system / user / assistant + `<|eot_id|>`) |
| Alignment method | DPO (Direct Preference Optimization) |
| Adapter | LoRA r=16, α=32, all linear → **merged** |
| Training framework | HuggingFace `trl` + `peft` |

## Training Recipe

DPO recipe following Chapter 6 (Direct Alignment Algorithms) of Nathan Lambert's [RLHF book](https://arxiv.org/abs/2504.12501), with [Zephyr](https://arxiv.org/abs/2310.16944) hyperparameters.

| Item | Value |
|------|-----|
| Dataset | [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) (~60K pairs) |
| β (KL strength) | 0.1 |
| Learning rate | 5e-7 (cosine, 10% warmup) |
| Effective batch size | 32 (per-device 8 × grad accum 4) |
| Epochs | 1 |
| Max length | 1024 (prompt 512) |
| Loss type | sigmoid (vanilla DPO) |
| LoRA targets | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Colab Pro+) |
| Reference model | Implicit via PEFT adapter toggle (no extra memory) |

After training, the LoRA adapter was merged into base weights with `merge_and_unload()`.

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "devwoo/Kybalion-1B-DPO"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum entanglement in simple terms."},
]
input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=400,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.15,
    eos_token_id=[tokenizer.eos_token_id, 128009],   # explicitly include <|eot_id|>
    pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
```

> 💡 **Always pass `eos_token_id=[..., 128009]` at generation time.** Llama 3.2's `<|eot_id|>` token must be registered as a stop token, otherwise the model leaks `assistant` tokens mid-response and may loop indefinitely.

## Limitations & Known Issues

This model has **several known limitations**. Please read before any production use.

### 1. Conservative training — limited alignment shift

- `lr=5e-7` is the value Zephyr used for **full fine-tuning**. With LoRA, this is generally too low (typical LoRA DPO uses `5e-6` to `5e-4`).
- Trained for only 1 epoch.
- → The output distribution likely did not move far from the base Kybalion-1B.

### 2. Qualitative evaluation on 10 prompts

We compared base vs. trained responses on 10 prompts (helpfulness / reasoning / coding / advice / creative) using identical sampling (T=0.7, top_p=0.9):

| Outcome | Count |
|---------|-------|
| **Clear trained win** | 2/10 (instruction following — prompts 5, 7) |
| **Clear base win** | 3/10 (arithmetic reasoning, factual accuracy) |
| **Tie / both fail** | 5/10 |
| **New factual errors introduced by trained model** | 2 (sky blue → "aerosols"; Senso-ji → wrongly placed in Kamakura) |

→ DPO did not produce the expected uplift.

### 3. Stop-token leakage and repetition at generation time

- Leaving only the default `eos_token_id` makes the model ignore `<|eot_id|>` and keep generating.
- Tokens like `assistant`, `student` from the chat template leak into the middle of the response.
- **Always pass `eos_token_id=[tokenizer.eos_token_id, 128009]` and use `repetition_penalty >= 1.15`.**

### 4. Base-model capability ceiling

- 1B parameters — limited arithmetic and logical reasoning.
- English instruction following is weaker than 7B+ models.
- Korean output quality depends on the baseline Kybalion training distribution; this DPO pass used English UltraFeedback only.

## Evaluation

- **Automated benchmarks**: not run.
- **Qualitative**: 10-prompt base vs. trained comparison — see [Limitations §2](#limitations--known-issues).

For proper evaluation, the [Tulu 3 eval suite](https://github.com/allenai/open-instruct) or [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) are recommended.

## Reproduction

A single Colab notebook reproducing this model is available separately (BF16, LoRA r=16, A100 40GB, ~3–5h). For a stronger run, we recommend:

| Change | Effect |
|--------|--------|
| `lr=5e-6` (10×) | Lets DPO actually move the policy |
| `num_epochs=3` | Absorbs more of the signal |
| `β=0.05` | Loosens the KL constraint slightly |
| `repetition_penalty=1.15` (inference) | Cuts repetition |
| `eos_token_id=[128001, 128009]` (inference) | Proper stop-token handling |

## Citation

```bibtex
@misc{kybalion-1b-dpo,
  title  = {Kybalion-1B-DPO: DPO-aligned Kybalion-1B},
  author = {devwoo},
  year   = {2026},
  url    = {https://huggingface.co/devwoo/Kybalion-1B-DPO},
  note   = {DPO recipe following Tunstall et al. (Zephyr) and Rafailov et al. (DPO)},
}
```

References:

- Rafailov et al., 2023. [**Direct Preference Optimization: Your Language Model is Secretly a Reward Model.**](https://arxiv.org/abs/2305.18290)
- Tunstall et al., 2023. [**Zephyr: Direct Distillation of LM Alignment.**](https://arxiv.org/abs/2310.16944)
- Lambert, 2025. [**Reinforcement Learning from Human Feedback (book).**](https://arxiv.org/abs/2504.12501)
- Ouyang et al., 2022. [**Training Language Models to Follow Instructions with Human Feedback (InstructGPT).**](https://arxiv.org/abs/2203.02155)

## License

Inherited from base: [**Llama 3.2 Community License**](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt).