File size: 6,526 Bytes
cbf9c41 a145447 cbf9c41 7ce0d37 cbf9c41 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | ---
language:
- en
license: llama3.2
base_model: meta-llama/Llama-3.2-1B
tags:
- llama
- continued-pretraining
- sft
- lora
- 1b
- math
- code
- education
- small-llm
datasets:
- HuggingFaceFW/fineweb-edu
- open-web-math/open-web-math
- bigcode/starcoderdata
- HuggingFaceTB/cosmopedia
- teknium/OpenHermes-2.5
- meta-math/MetaMathQA
- sahil2801/CodeAlpaca-20k
---
# Kybalion-1B
**Kybalion-1B** is a 1B-parameter language model built on top of [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) through a full **Continued Pre-Training (CPT) β Supervised Fine-Tuning (SFT)** pipeline, trained entirely on Google Colab A100.
> **Why "Kybalion"?**
> The model was originally developed under the internal codename *Prometheus-1B*, but was renamed to *Kybalion-1B* before public release to avoid confusion with an existing model of the same name on HuggingFace. *Kybalion* refers to the ancient hermetic text symbolizing hidden knowledge β fitting for a model focused on education, mathematics, science, and code.
---
## π Key Highlights
- **Beats Llama-3.2-1B-Instruct** on HellaSwag (63.8% vs 61.1%) and ties on WinoGrande (62.4%)
- **4.5Γ GSM8K improvement** over TinyLlama-1.1B (10.8% vs 2.4%) β math pretraining works
- Outperforms TinyLlama-1.1B on **all 6 benchmarks**
- Trained by a single undergraduate student on consumer cloud hardware
---
## π¬ Key Contributions
- Demonstrates that domain-balanced continued pretraining on curated multi-domain data (education, math, code, science) yields consistent improvements across commonsense reasoning benchmarks in 1B-scale models
- Suggests that multi-step mathematical reasoning remains a fundamental bottleneck for 1B-scale models, even when combining math-focused pretraining (OpenWebMath) with instruction tuning (MetaMathQA)
- Provides a fully reproducible, compute-efficient training recipe (CPT β LoRA SFT) built and executed **by a single undergraduate student in under one week**, demonstrating that meaningful LLM research is achievable without institutional resources or large teams
---
## π Benchmark Results
All scores measured with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) under **identical conditions** (same prompts, same few-shot settings, same hardware).
| Benchmark | TinyLlama-1.1B | Llama-3.2-1B-Instruct | **Kybalion-1B** |
|-----------|:--------------:|:---------------------:|:---------------:|
| MMLU | 25.0% | 46.1% | **32.0%** |
| ARC-C | 37.2% | 41.5% | **37.6%** |
| GSM8K | 2.4% | 33.5% | **10.8%** |
| HellaSwag | 61.2% | 61.1% | **63.8%** π |
| WinoGrande | 61.8% | 62.4% | **62.4%** π |
| TruthfulQA | 37.4% | 43.3% | **40.0%** |
> π = outperforms Llama-3.2-1B-Instruct
> All evaluations run with `lm_eval.simple_evaluate()`, bfloat16, batch_size=8, A100 GPU.
---
## π§ Training Pipeline
### Phase 1: Continued Pre-Training (CPT)
Fine-tuned the base weights of `meta-llama/Llama-3.2-1B` on ~3.5B tokens of curated multi-domain data.
| Domain | Dataset | Ratio | Purpose |
|--------|---------|-------|---------|
| Education | FineWeb-Edu (score β₯ 3.0) | 35% | General knowledge & reasoning |
| Mathematics | OpenWebMath | 20% | Mathematical reasoning |
| Code | StarCoderData (Python) | 15% | Code generation |
| Textbook | Cosmopedia web_samples_v2 | 15% | Structured knowledge |
| Science | Cosmopedia stanford | 10% | Scientific reasoning |
| Story | Cosmopedia stories | 5% | Language fluency |
**Training config:**
- Hardware: Google Colab A100 80GB
- Optimizer: AdamW, LR = 2e-5, Cosine decay, Warmup = 1000 steps
- Precision: BF16
- Effective batch size: 32 (4 Γ 8 grad accum)
- Sequence length: 2048 (packed)
- Framework: HuggingFace `transformers.Trainer` (no Unsloth)
### Phase 2: Supervised Fine-Tuning (SFT)
Applied LoRA adapters to teach instruction-following, then merged into base weights.
| Dataset | Size | Purpose |
|---------|------|---------|
| OpenHermes 2.5 | 100K | General instruction following |
| MetaMathQA | 50K | Mathematical reasoning (GSM8K boost) |
| CodeAlpaca | 20K | Code generation |
**SFT config:**
- Method: LoRA (r=64, Ξ±=128, dropout=0.05)
- Target modules: q/k/v/o/gate/up/down proj (all linear layers)
- LR = 1e-4, Epochs = 3, Cosine decay
- Merged with `PeftModel.merge_and_unload()` for standalone deployment
---
## π» Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B")
model = AutoModelForCausalLM.from_pretrained(
"devwoo/Kybalion-1B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
def chat(user_message, system="You are a helpful and knowledgeable AI assistant."):
prompt = (
f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
f"{system}<|eot_id|>"
f"<|start_header_id|>user<|end_header_id|>\n\n"
f"{user_message}<|eot_id|>"
f"<|start_header_id|>assistant<|end_header_id|>\n\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
)
return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(chat("Explain the Pythagorean theorem and give an example."))
print(chat("Write a Python function to check if a number is prime."))
```
---
## π¦ GGUF Version
A quantized **GGUF q4_k_m** version is available at [devwoo/Kybalion-1B-GGUF](https://huggingface.co/devwoo/Kybalion-1B-GGUF) for CPU/mobile inference with [llama.cpp](https://github.com/ggerganov/llama.cpp) or [Ollama](https://ollama.com).
```bash
# With llama.cpp
./llama-cli -m Kybalion-1B-q4_k_m.gguf -p "Explain quantum computing." -n 256
```
---
## β οΈ Limitations
- 1B parameters β smaller than most production models; may struggle with complex multi-step reasoning
- Not RLHF-aligned; may occasionally produce unhelpful or inconsistent responses
- English-only training data
- GSM8K score (10.8%) reflects room for improvement in math reasoning compared to larger models
---
## π License
This model is derived from `meta-llama/Llama-3.2-1B` and follows the [Llama 3.2 Community License](https://ai.meta.com/llama/license/).
Training datasets are used under their respective open licenses.
|