Instructions to use devwoo/Kybalion-1B-DPO with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use devwoo/Kybalion-1B-DPO with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="devwoo/Kybalion-1B-DPO") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B-DPO") model = AutoModelForCausalLM.from_pretrained("devwoo/Kybalion-1B-DPO") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use devwoo/Kybalion-1B-DPO with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use devwoo/Kybalion-1B-DPO with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "devwoo/Kybalion-1B-DPO" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "devwoo/Kybalion-1B-DPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/devwoo/Kybalion-1B-DPO
- SGLang
How to use devwoo/Kybalion-1B-DPO with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "devwoo/Kybalion-1B-DPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "devwoo/Kybalion-1B-DPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "devwoo/Kybalion-1B-DPO" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "devwoo/Kybalion-1B-DPO", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use devwoo/Kybalion-1B-DPO with Docker Model Runner:
docker model run hf.co/devwoo/Kybalion-1B-DPO
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B-DPO")
model = AutoModelForCausalLM.from_pretrained("devwoo/Kybalion-1B-DPO")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Kybalion-1B-DPO
A 1B-parameter instruct model aligned on top of devwoo/Kybalion-1B (Llama 3.2 1B + CPT + SFT) using Direct Preference Optimization (DPO). The LoRA adapter has been merged into the base weights so this is a standalone instruct model — no PEFT adapter required at inference.
⚠️ Experimental / educational model. Training used conservative hyperparameters, so quality improvements over the base Kybalion-1B are modest. See Limitations for an honest assessment.
Model Details
| Item | Value |
|---|---|
| Base | devwoo/Kybalion-1B (built on Llama 3.2 1B) |
| Parameters | 1.24 B |
| Precision | BF16 |
| Context length | 2048 |
| Tokenizer | Llama 3.2 standard |
| Chat template | Llama 3.2 (system / user / assistant + `< |
| Alignment method | DPO (Direct Preference Optimization) |
| Adapter | LoRA r=16, α=32, all linear → merged |
| Training framework | HuggingFace trl + peft |
Training Recipe
DPO recipe following Chapter 6 (Direct Alignment Algorithms) of Nathan Lambert's RLHF book, with Zephyr hyperparameters.
| Item | Value |
|---|---|
| Dataset | argilla/ultrafeedback-binarized-preferences-cleaned (~60K pairs) |
| β (KL strength) | 0.1 |
| Learning rate | 5e-7 (cosine, 10% warmup) |
| Effective batch size | 32 (per-device 8 × grad accum 4) |
| Epochs | 1 |
| Max length | 1024 (prompt 512) |
| Loss type | sigmoid (vanilla DPO) |
| LoRA targets | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Optimizer | AdamW |
| Hardware | NVIDIA A100 40GB (Colab Pro+) |
| Reference model | Implicit via PEFT adapter toggle (no extra memory) |
After training, the LoRA adapter was merged into base weights with merge_and_unload().
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "devwoo/Kybalion-1B-DPO"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum entanglement in simple terms."},
]
input_text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=400,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.15,
eos_token_id=[tokenizer.eos_token_id, 128009], # explicitly include <|eot_id|>
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
💡 Always pass
eos_token_id=[..., 128009]at generation time. Llama 3.2's<|eot_id|>token must be registered as a stop token, otherwise the model leaksassistanttokens mid-response and may loop indefinitely.
Limitations & Known Issues
This model has several known limitations. Please read before any production use.
1. Conservative training — limited alignment shift
lr=5e-7is the value Zephyr used for full fine-tuning. With LoRA, this is generally too low (typical LoRA DPO uses5e-6to5e-4).- Trained for only 1 epoch.
- → The output distribution likely did not move far from the base Kybalion-1B.
2. Qualitative evaluation on 10 prompts
We compared base vs. trained responses on 10 prompts (helpfulness / reasoning / coding / advice / creative) using identical sampling (T=0.7, top_p=0.9):
| Outcome | Count |
|---|---|
| Clear trained win | 2/10 (instruction following — prompts 5, 7) |
| Clear base win | 3/10 (arithmetic reasoning, factual accuracy) |
| Tie / both fail | 5/10 |
| New factual errors introduced by trained model | 2 (sky blue → "aerosols"; Senso-ji → wrongly placed in Kamakura) |
→ DPO did not produce the expected uplift.
3. Stop-token leakage and repetition at generation time
- Leaving only the default
eos_token_idmakes the model ignore<|eot_id|>and keep generating. - Tokens like
assistant,studentfrom the chat template leak into the middle of the response. - Always pass
eos_token_id=[tokenizer.eos_token_id, 128009]and userepetition_penalty >= 1.15.
4. Base-model capability ceiling
- 1B parameters — limited arithmetic and logical reasoning.
- English instruction following is weaker than 7B+ models.
- Korean output quality depends on the baseline Kybalion training distribution; this DPO pass used English UltraFeedback only.
Evaluation
- Automated benchmarks: not run.
- Qualitative: 10-prompt base vs. trained comparison — see Limitations §2.
For proper evaluation, the Tulu 3 eval suite or LM Evaluation Harness are recommended.
Reproduction
A single Colab notebook reproducing this model is available separately (BF16, LoRA r=16, A100 40GB, ~3–5h). For a stronger run, we recommend:
| Change | Effect |
|---|---|
lr=5e-6 (10×) |
Lets DPO actually move the policy |
num_epochs=3 |
Absorbs more of the signal |
β=0.05 |
Loosens the KL constraint slightly |
repetition_penalty=1.15 (inference) |
Cuts repetition |
eos_token_id=[128001, 128009] (inference) |
Proper stop-token handling |
Citation
@misc{kybalion-1b-dpo,
title = {Kybalion-1B-DPO: DPO-aligned Kybalion-1B},
author = {devwoo},
year = {2026},
url = {https://huggingface.co/devwoo/Kybalion-1B-DPO},
note = {DPO recipe following Tunstall et al. (Zephyr) and Rafailov et al. (DPO)},
}
References:
- Rafailov et al., 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- Tunstall et al., 2023. Zephyr: Direct Distillation of LM Alignment.
- Lambert, 2025. Reinforcement Learning from Human Feedback (book).
- Ouyang et al., 2022. Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
License
Inherited from base: Llama 3.2 Community License.
- Downloads last month
- 27
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="devwoo/Kybalion-1B-DPO") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)