Text Generation
Transformers
Safetensors
PEFT
English
llama
dpo
rlhf
llama-3.2
kybalion
trl
lora
merged
conversational
text-generation-inference
Instructions to use devwoo/Kybalion-1B-DPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use devwoo/Kybalion-1B-DPO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="devwoo/Kybalion-1B-DPO") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B-DPO") model = AutoModelForCausalLM.from_pretrained("devwoo/Kybalion-1B-DPO") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use devwoo/Kybalion-1B-DPO with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use devwoo/Kybalion-1B-DPO with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "devwoo/Kybalion-1B-DPO" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "devwoo/Kybalion-1B-DPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/devwoo/Kybalion-1B-DPO
- SGLang
How to use devwoo/Kybalion-1B-DPO with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "devwoo/Kybalion-1B-DPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "devwoo/Kybalion-1B-DPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "devwoo/Kybalion-1B-DPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "devwoo/Kybalion-1B-DPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use devwoo/Kybalion-1B-DPO with Docker Model Runner:
docker model run hf.co/devwoo/Kybalion-1B-DPO
| license: llama3.2 | |
| base_model: devwoo/Kybalion-1B | |
| tags: | |
| - dpo | |
| - rlhf | |
| - llama | |
| - llama-3.2 | |
| - kybalion | |
| - trl | |
| - peft | |
| - lora | |
| - merged | |
| datasets: | |
| - argilla/ultrafeedback-binarized-preferences-cleaned | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # Kybalion-1B-DPO | |
| A 1B-parameter instruct model aligned on top of [devwoo/Kybalion-1B](https://huggingface.co/devwoo/Kybalion-1B) (Llama 3.2 1B + CPT + SFT) using **Direct Preference Optimization (DPO)**. The LoRA adapter has been merged into the base weights so this is a standalone instruct model — no PEFT adapter required at inference. | |
| > ⚠️ **Experimental / educational model.** Training used conservative hyperparameters, so quality improvements over the base Kybalion-1B are modest. See [Limitations](#limitations--known-issues) for an honest assessment. | |
| ## Model Details | |
| | Item | Value | | |
| |------|-----| | |
| | Base | [devwoo/Kybalion-1B](https://huggingface.co/devwoo/Kybalion-1B) (built on Llama 3.2 1B) | | |
| | Parameters | 1.24 B | | |
| | Precision | BF16 | | |
| | Context length | 2048 | | |
| | Tokenizer | Llama 3.2 standard | | |
| | Chat template | Llama 3.2 (system / user / assistant + `<|eot_id|>`) | | |
| | Alignment method | DPO (Direct Preference Optimization) | | |
| | Adapter | LoRA r=16, α=32, all linear → **merged** | | |
| | Training framework | HuggingFace `trl` + `peft` | | |
| ## Training Recipe | |
| DPO recipe following Chapter 6 (Direct Alignment Algorithms) of Nathan Lambert's [RLHF book](https://arxiv.org/abs/2504.12501), with [Zephyr](https://arxiv.org/abs/2310.16944) hyperparameters. | |
| | Item | Value | | |
| |------|-----| | |
| | Dataset | [argilla/ultrafeedback-binarized-preferences-cleaned](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences-cleaned) (~60K pairs) | | |
| | β (KL strength) | 0.1 | | |
| | Learning rate | 5e-7 (cosine, 10% warmup) | | |
| | Effective batch size | 32 (per-device 8 × grad accum 4) | | |
| | Epochs | 1 | | |
| | Max length | 1024 (prompt 512) | | |
| | Loss type | sigmoid (vanilla DPO) | | |
| | LoRA targets | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` | | |
| | Optimizer | AdamW | | |
| | Hardware | NVIDIA A100 40GB (Colab Pro+) | | |
| | Reference model | Implicit via PEFT adapter toggle (no extra memory) | | |
| After training, the LoRA adapter was merged into base weights with `merge_and_unload()`. | |
| ## Usage | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model_id = "devwoo/Kybalion-1B-DPO" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| messages = [ | |
| {"role": "user", "content": "Explain quantum entanglement in simple terms."}, | |
| ] | |
| input_text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| inputs = tokenizer(input_text, return_tensors="pt").to(model.device) | |
| out = model.generate( | |
| **inputs, | |
| max_new_tokens=400, | |
| temperature=0.7, | |
| top_p=0.9, | |
| do_sample=True, | |
| repetition_penalty=1.15, | |
| eos_token_id=[tokenizer.eos_token_id, 128009], # explicitly include <|eot_id|> | |
| pad_token_id=tokenizer.eos_token_id, | |
| ) | |
| print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| > 💡 **Always pass `eos_token_id=[..., 128009]` at generation time.** Llama 3.2's `<|eot_id|>` token must be registered as a stop token, otherwise the model leaks `assistant` tokens mid-response and may loop indefinitely. | |
| ## Limitations & Known Issues | |
| This model has **several known limitations**. Please read before any production use. | |
| ### 1. Conservative training — limited alignment shift | |
| - `lr=5e-7` is the value Zephyr used for **full fine-tuning**. With LoRA, this is generally too low (typical LoRA DPO uses `5e-6` to `5e-4`). | |
| - Trained for only 1 epoch. | |
| - → The output distribution likely did not move far from the base Kybalion-1B. | |
| ### 2. Qualitative evaluation on 10 prompts | |
| We compared base vs. trained responses on 10 prompts (helpfulness / reasoning / coding / advice / creative) using identical sampling (T=0.7, top_p=0.9): | |
| | Outcome | Count | | |
| |---------|-------| | |
| | **Clear trained win** | 2/10 (instruction following — prompts 5, 7) | | |
| | **Clear base win** | 3/10 (arithmetic reasoning, factual accuracy) | | |
| | **Tie / both fail** | 5/10 | | |
| | **New factual errors introduced by trained model** | 2 (sky blue → "aerosols"; Senso-ji → wrongly placed in Kamakura) | | |
| → DPO did not produce the expected uplift. | |
| ### 3. Stop-token leakage and repetition at generation time | |
| - Leaving only the default `eos_token_id` makes the model ignore `<|eot_id|>` and keep generating. | |
| - Tokens like `assistant`, `student` from the chat template leak into the middle of the response. | |
| - **Always pass `eos_token_id=[tokenizer.eos_token_id, 128009]` and use `repetition_penalty >= 1.15`.** | |
| ### 4. Base-model capability ceiling | |
| - 1B parameters — limited arithmetic and logical reasoning. | |
| - English instruction following is weaker than 7B+ models. | |
| - Korean output quality depends on the baseline Kybalion training distribution; this DPO pass used English UltraFeedback only. | |
| ## Evaluation | |
| - **Automated benchmarks**: not run. | |
| - **Qualitative**: 10-prompt base vs. trained comparison — see [Limitations §2](#limitations--known-issues). | |
| For proper evaluation, the [Tulu 3 eval suite](https://github.com/allenai/open-instruct) or [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) are recommended. | |
| ## Reproduction | |
| A single Colab notebook reproducing this model is available separately (BF16, LoRA r=16, A100 40GB, ~3–5h). For a stronger run, we recommend: | |
| | Change | Effect | | |
| |--------|--------| | |
| | `lr=5e-6` (10×) | Lets DPO actually move the policy | | |
| | `num_epochs=3` | Absorbs more of the signal | | |
| | `β=0.05` | Loosens the KL constraint slightly | | |
| | `repetition_penalty=1.15` (inference) | Cuts repetition | | |
| | `eos_token_id=[128001, 128009]` (inference) | Proper stop-token handling | | |
| ## Citation | |
| ```bibtex | |
| @misc{kybalion-1b-dpo, | |
| title = {Kybalion-1B-DPO: DPO-aligned Kybalion-1B}, | |
| author = {devwoo}, | |
| year = {2026}, | |
| url = {https://huggingface.co/devwoo/Kybalion-1B-DPO}, | |
| note = {DPO recipe following Tunstall et al. (Zephyr) and Rafailov et al. (DPO)}, | |
| } | |
| ``` | |
| References: | |
| - Rafailov et al., 2023. [**Direct Preference Optimization: Your Language Model is Secretly a Reward Model.**](https://arxiv.org/abs/2305.18290) | |
| - Tunstall et al., 2023. [**Zephyr: Direct Distillation of LM Alignment.**](https://arxiv.org/abs/2310.16944) | |
| - Lambert, 2025. [**Reinforcement Learning from Human Feedback (book).**](https://arxiv.org/abs/2504.12501) | |
| - Ouyang et al., 2022. [**Training Language Models to Follow Instructions with Human Feedback (InstructGPT).**](https://arxiv.org/abs/2203.02155) | |
| ## License | |
| Inherited from base: [**Llama 3.2 Community License**](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt). |