--- license: llama3.2 base_model: devwoo/Kybalion-1B tags: - dpo - rlhf - llama - llama-3.2 - kybalion - trl - peft - lora - merged datasets: - argilla/ultrafeedback-binarized-preferences-cleaned language: - en library_name: transformers pipeline_tag: text-generation --- # Kybalion-1B-DPO A 1B-parameter instruct model aligned on top of [devwoo/Kybalion-1B](https://huggingface.co/devwoo/Kybalion-1B) (Llama 3.2 1B + CPT + SFT) using **Direct Preference Optimization (DPO)**. The LoRA adapter has been merged into the base weights so this is a standalone instruct model — no PEFT adapter required at inference. > ⚠️ **Experimental / educational model.** Training used conservative hyperparameters, so quality improvements over the base Kybalion-1B are modest. See [Limitations](#limitations--known-issues) for an honest assessment. ## Model Details | Item | Value | |------|-----| | Base | [devwoo/Kybalion-1B](https://huggingface.co/devwoo/Kybalion-1B) (built on Llama 3.2 1B) | | Parameters | 1.24 B | | Precision | BF16 | | Context length | 2048 | | Tokenizer | Llama 3.2 standard | | Chat template | Llama 3.2 (system / user / assistant + `<|eot_id|>`) | | Alignment method | DPO (Direct Preference Optimization) | | Adapter | LoRA r=16, α=32, all linear → **merged** | | Training framework | HuggingFace `trl` + `peft` | ## Training Recipe DPO recipe following Chapter 6 (Direct Alignment Algorithms) of Nathan Lambert's [RLHF book](https://arxiv.org/abs/2504.12501), with [Zephyr](https://arxiv.org/abs/2310.16944) hyperparameters. | Item | Value | |------|-----| | Dataset | [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) (~60K pairs) | | β (KL strength) | 0.1 | | Learning rate | 5e-7 (cosine, 10% warmup) | | Effective batch size | 32 (per-device 8 × grad accum 4) | | Epochs | 1 | | Max length | 1024 (prompt 512) | | Loss type | sigmoid (vanilla DPO) | | LoRA targets | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` | | Optimizer | AdamW | | Hardware | NVIDIA A100 40GB (Colab Pro+) | | Reference model | Implicit via PEFT adapter toggle (no extra memory) | After training, the LoRA adapter was merged into base weights with `merge_and_unload()`. ## Usage ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "devwoo/Kybalion-1B-DPO" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) messages = [ {"role": "user", "content": "Explain quantum entanglement in simple terms."}, ] input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(input_text, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=400, temperature=0.7, top_p=0.9, do_sample=True, repetition_penalty=1.15, eos_token_id=[tokenizer.eos_token_id, 128009], # explicitly include <|eot_id|> pad_token_id=tokenizer.eos_token_id, ) print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) ``` > 💡 **Always pass `eos_token_id=[..., 128009]` at generation time.** Llama 3.2's `<|eot_id|>` token must be registered as a stop token, otherwise the model leaks `assistant` tokens mid-response and may loop indefinitely. ## Limitations & Known Issues This model has **several known limitations**. Please read before any production use. ### 1. Conservative training — limited alignment shift - `lr=5e-7` is the value Zephyr used for **full fine-tuning**. With LoRA, this is generally too low (typical LoRA DPO uses `5e-6` to `5e-4`). - Trained for only 1 epoch. - → The output distribution likely did not move far from the base Kybalion-1B. ### 2. Qualitative evaluation on 10 prompts We compared base vs. trained responses on 10 prompts (helpfulness / reasoning / coding / advice / creative) using identical sampling (T=0.7, top_p=0.9): | Outcome | Count | |---------|-------| | **Clear trained win** | 2/10 (instruction following — prompts 5, 7) | | **Clear base win** | 3/10 (arithmetic reasoning, factual accuracy) | | **Tie / both fail** | 5/10 | | **New factual errors introduced by trained model** | 2 (sky blue → "aerosols"; Senso-ji → wrongly placed in Kamakura) | → DPO did not produce the expected uplift. ### 3. Stop-token leakage and repetition at generation time - Leaving only the default `eos_token_id` makes the model ignore `<|eot_id|>` and keep generating. - Tokens like `assistant`, `student` from the chat template leak into the middle of the response. - **Always pass `eos_token_id=[tokenizer.eos_token_id, 128009]` and use `repetition_penalty >= 1.15`.** ### 4. Base-model capability ceiling - 1B parameters — limited arithmetic and logical reasoning. - English instruction following is weaker than 7B+ models. - Korean output quality depends on the baseline Kybalion training distribution; this DPO pass used English UltraFeedback only. ## Evaluation - **Automated benchmarks**: not run. - **Qualitative**: 10-prompt base vs. trained comparison — see [Limitations §2](#limitations--known-issues). For proper evaluation, the [Tulu 3 eval suite](https://github.com/allenai/open-instruct) or [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) are recommended. ## Reproduction A single Colab notebook reproducing this model is available separately (BF16, LoRA r=16, A100 40GB, ~3–5h). For a stronger run, we recommend: | Change | Effect | |--------|--------| | `lr=5e-6` (10×) | Lets DPO actually move the policy | | `num_epochs=3` | Absorbs more of the signal | | `β=0.05` | Loosens the KL constraint slightly | | `repetition_penalty=1.15` (inference) | Cuts repetition | | `eos_token_id=[128001, 128009]` (inference) | Proper stop-token handling | ## Citation ```bibtex @misc{kybalion-1b-dpo, title = {Kybalion-1B-DPO: DPO-aligned Kybalion-1B}, author = {devwoo}, year = {2026}, url = {https://huggingface.co/devwoo/Kybalion-1B-DPO}, note = {DPO recipe following Tunstall et al. (Zephyr) and Rafailov et al. (DPO)}, } ``` References: - Rafailov et al., 2023. [**Direct Preference Optimization: Your Language Model is Secretly a Reward Model.**](https://arxiv.org/abs/2305.18290) - Tunstall et al., 2023. [**Zephyr: Direct Distillation of LM Alignment.**](https://arxiv.org/abs/2310.16944) - Lambert, 2025. [**Reinforcement Learning from Human Feedback (book).**](https://arxiv.org/abs/2504.12501) - Ouyang et al., 2022. [**Training Language Models to Follow Instructions with Human Feedback (InstructGPT).**](https://arxiv.org/abs/2203.02155) ## License Inherited from base: [**Llama 3.2 Community License**](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt).