File size: 6,526 Bytes
cbf9c41
 
 
 
 
 
 
 
 
 
 
 
 
 
a145447
cbf9c41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ce0d37
 
 
 
 
 
 
 
cbf9c41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
---
language:
- en
license: llama3.2
base_model: meta-llama/Llama-3.2-1B
tags:
- llama
- continued-pretraining
- sft
- lora
- 1b
- math
- code
- education
- small-llm
datasets:
- HuggingFaceFW/fineweb-edu
- open-web-math/open-web-math
- bigcode/starcoderdata
- HuggingFaceTB/cosmopedia
- teknium/OpenHermes-2.5
- meta-math/MetaMathQA
- sahil2801/CodeAlpaca-20k
---

# Kybalion-1B

**Kybalion-1B** is a 1B-parameter language model built on top of [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) through a full **Continued Pre-Training (CPT) β†’ Supervised Fine-Tuning (SFT)** pipeline, trained entirely on Google Colab A100.

> **Why "Kybalion"?**
> The model was originally developed under the internal codename *Prometheus-1B*, but was renamed to *Kybalion-1B* before public release to avoid confusion with an existing model of the same name on HuggingFace. *Kybalion* refers to the ancient hermetic text symbolizing hidden knowledge β€” fitting for a model focused on education, mathematics, science, and code.

---

## πŸ† Key Highlights

- **Beats Llama-3.2-1B-Instruct** on HellaSwag (63.8% vs 61.1%) and ties on WinoGrande (62.4%)
- **4.5Γ— GSM8K improvement** over TinyLlama-1.1B (10.8% vs 2.4%) β€” math pretraining works
- Outperforms TinyLlama-1.1B on **all 6 benchmarks**
- Trained by a single undergraduate student on consumer cloud hardware

---

## πŸ”¬ Key Contributions

- Demonstrates that domain-balanced continued pretraining on curated multi-domain data (education, math, code, science) yields consistent improvements across commonsense reasoning benchmarks in 1B-scale models
- Suggests that multi-step mathematical reasoning remains a fundamental bottleneck for 1B-scale models, even when combining math-focused pretraining (OpenWebMath) with instruction tuning (MetaMathQA)
- Provides a fully reproducible, compute-efficient training recipe (CPT β†’ LoRA SFT) built and executed **by a single undergraduate student in under one week**, demonstrating that meaningful LLM research is achievable without institutional resources or large teams

---

## πŸ“Š Benchmark Results

All scores measured with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) under **identical conditions** (same prompts, same few-shot settings, same hardware).

| Benchmark | TinyLlama-1.1B | Llama-3.2-1B-Instruct | **Kybalion-1B** |
|-----------|:--------------:|:---------------------:|:---------------:|
| MMLU | 25.0% | 46.1% | **32.0%** |
| ARC-C | 37.2% | 41.5% | **37.6%** |
| GSM8K | 2.4% | 33.5% | **10.8%** |
| HellaSwag | 61.2% | 61.1% | **63.8%** πŸ† |
| WinoGrande | 61.8% | 62.4% | **62.4%** πŸ† |
| TruthfulQA | 37.4% | 43.3% | **40.0%** |

> πŸ† = outperforms Llama-3.2-1B-Instruct
> All evaluations run with `lm_eval.simple_evaluate()`, bfloat16, batch_size=8, A100 GPU.

---

## πŸ”§ Training Pipeline

### Phase 1: Continued Pre-Training (CPT)

Fine-tuned the base weights of `meta-llama/Llama-3.2-1B` on ~3.5B tokens of curated multi-domain data.

| Domain | Dataset | Ratio | Purpose |
|--------|---------|-------|---------|
| Education | FineWeb-Edu (score β‰₯ 3.0) | 35% | General knowledge & reasoning |
| Mathematics | OpenWebMath | 20% | Mathematical reasoning |
| Code | StarCoderData (Python) | 15% | Code generation |
| Textbook | Cosmopedia web_samples_v2 | 15% | Structured knowledge |
| Science | Cosmopedia stanford | 10% | Scientific reasoning |
| Story | Cosmopedia stories | 5% | Language fluency |

**Training config:**
- Hardware: Google Colab A100 80GB
- Optimizer: AdamW, LR = 2e-5, Cosine decay, Warmup = 1000 steps
- Precision: BF16
- Effective batch size: 32 (4 Γ— 8 grad accum)
- Sequence length: 2048 (packed)
- Framework: HuggingFace `transformers.Trainer` (no Unsloth)

### Phase 2: Supervised Fine-Tuning (SFT)

Applied LoRA adapters to teach instruction-following, then merged into base weights.

| Dataset | Size | Purpose |
|---------|------|---------|
| OpenHermes 2.5 | 100K | General instruction following |
| MetaMathQA | 50K | Mathematical reasoning (GSM8K boost) |
| CodeAlpaca | 20K | Code generation |

**SFT config:**
- Method: LoRA (r=64, Ξ±=128, dropout=0.05)
- Target modules: q/k/v/o/gate/up/down proj (all linear layers)
- LR = 1e-4, Epochs = 3, Cosine decay
- Merged with `PeftModel.merge_and_unload()` for standalone deployment

---

## πŸ’» Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B")
model = AutoModelForCausalLM.from_pretrained(
    "devwoo/Kybalion-1B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def chat(user_message, system="You are a helpful and knowledgeable AI assistant."):
    prompt = (
        f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        f"{system}<|eot_id|>"
        f"<|start_header_id|>user<|end_header_id|>\n\n"
        f"{user_message}<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>\n\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        )
    return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

print(chat("Explain the Pythagorean theorem and give an example."))
print(chat("Write a Python function to check if a number is prime."))
```

---

## πŸ“¦ GGUF Version

A quantized **GGUF q4_k_m** version is available at [devwoo/Kybalion-1B-GGUF](https://huggingface.co/devwoo/Kybalion-1B-GGUF) for CPU/mobile inference with [llama.cpp](https://github.com/ggerganov/llama.cpp) or [Ollama](https://ollama.com).

```bash
# With llama.cpp
./llama-cli -m Kybalion-1B-q4_k_m.gguf -p "Explain quantum computing." -n 256
```

---

## ⚠️ Limitations

- 1B parameters β€” smaller than most production models; may struggle with complex multi-step reasoning
- Not RLHF-aligned; may occasionally produce unhelpful or inconsistent responses
- English-only training data
- GSM8K score (10.8%) reflects room for improvement in math reasoning compared to larger models

---

## πŸ“„ License

This model is derived from `meta-llama/Llama-3.2-1B` and follows the [Llama 3.2 Community License](https://ai.meta.com/llama/license/).
Training datasets are used under their respective open licenses.