Update README.md

a145447 verified about 2 months ago

6.53 kB

	---
	language:
	- en
	license: llama3.2
	base_model: meta-llama/Llama-3.2-1B
	tags:
	- llama
	- continued-pretraining
	- sft
	- lora
	- 1b
	- math
	- code
	- education
	- small-llm
	datasets:
	- HuggingFaceFW/fineweb-edu
	- open-web-math/open-web-math
	- bigcode/starcoderdata
	- HuggingFaceTB/cosmopedia
	- teknium/OpenHermes-2.5
	- meta-math/MetaMathQA
	- sahil2801/CodeAlpaca-20k
	---

	# Kybalion-1B

	Kybalion-1B is a 1B-parameter language model built on top of [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) through a full Continued Pre-Training (CPT) → Supervised Fine-Tuning (SFT) pipeline, trained entirely on Google Colab A100.

	> Why "Kybalion"?
	> The model was originally developed under the internal codename Prometheus-1B, but was renamed to Kybalion-1B before public release to avoid confusion with an existing model of the same name on HuggingFace. Kybalion refers to the ancient hermetic text symbolizing hidden knowledge — fitting for a model focused on education, mathematics, science, and code.

	---

	## 🏆 Key Highlights

	- Beats Llama-3.2-1B-Instruct on HellaSwag (63.8% vs 61.1%) and ties on WinoGrande (62.4%)
	- 4.5× GSM8K improvement over TinyLlama-1.1B (10.8% vs 2.4%) — math pretraining works
	- Outperforms TinyLlama-1.1B on all 6 benchmarks
	- Trained by a single undergraduate student on consumer cloud hardware

	---

	## 🔬 Key Contributions

	- Demonstrates that domain-balanced continued pretraining on curated multi-domain data (education, math, code, science) yields consistent improvements across commonsense reasoning benchmarks in 1B-scale models
	- Suggests that multi-step mathematical reasoning remains a fundamental bottleneck for 1B-scale models, even when combining math-focused pretraining (OpenWebMath) with instruction tuning (MetaMathQA)
	- Provides a fully reproducible, compute-efficient training recipe (CPT → LoRA SFT) built and executed by a single undergraduate student in under one week, demonstrating that meaningful LLM research is achievable without institutional resources or large teams

	---

	## 📊 Benchmark Results

	All scores measured with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) under identical conditions (same prompts, same few-shot settings, same hardware).

	\| Benchmark \| TinyLlama-1.1B \| Llama-3.2-1B-Instruct \| Kybalion-1B \|
	\|-----------\|:--------------:\|:---------------------:\|:---------------:\|
	\| MMLU \| 25.0% \| 46.1% \| 32.0% \|
	\| ARC-C \| 37.2% \| 41.5% \| 37.6% \|
	\| GSM8K \| 2.4% \| 33.5% \| 10.8% \|
	\| HellaSwag \| 61.2% \| 61.1% \| 63.8% 🏆 \|
	\| WinoGrande \| 61.8% \| 62.4% \| 62.4% 🏆 \|
	\| TruthfulQA \| 37.4% \| 43.3% \| 40.0% \|

	> 🏆 = outperforms Llama-3.2-1B-Instruct
	> All evaluations run with `lm_eval.simple_evaluate()`, bfloat16, batch_size=8, A100 GPU.

	---

	## 🔧 Training Pipeline

	### Phase 1: Continued Pre-Training (CPT)

	Fine-tuned the base weights of `meta-llama/Llama-3.2-1B` on ~3.5B tokens of curated multi-domain data.

	\| Domain \| Dataset \| Ratio \| Purpose \|
	\|--------\|---------\|-------\|---------\|
	\| Education \| FineWeb-Edu (score ≥ 3.0) \| 35% \| General knowledge & reasoning \|
	\| Mathematics \| OpenWebMath \| 20% \| Mathematical reasoning \|
	\| Code \| StarCoderData (Python) \| 15% \| Code generation \|
	\| Textbook \| Cosmopedia web_samples_v2 \| 15% \| Structured knowledge \|
	\| Science \| Cosmopedia stanford \| 10% \| Scientific reasoning \|
	\| Story \| Cosmopedia stories \| 5% \| Language fluency \|

	Training config:
	- Hardware: Google Colab A100 80GB
	- Optimizer: AdamW, LR = 2e-5, Cosine decay, Warmup = 1000 steps
	- Precision: BF16
	- Effective batch size: 32 (4 × 8 grad accum)
	- Sequence length: 2048 (packed)
	- Framework: HuggingFace `transformers.Trainer` (no Unsloth)

	### Phase 2: Supervised Fine-Tuning (SFT)

	Applied LoRA adapters to teach instruction-following, then merged into base weights.

	\| Dataset \| Size \| Purpose \|
	\|---------\|------\|---------\|
	\| OpenHermes 2.5 \| 100K \| General instruction following \|
	\| MetaMathQA \| 50K \| Mathematical reasoning (GSM8K boost) \|
	\| CodeAlpaca \| 20K \| Code generation \|

	SFT config:
	- Method: LoRA (r=64, α=128, dropout=0.05)
	- Target modules: q/k/v/o/gate/up/down proj (all linear layers)
	- LR = 1e-4, Epochs = 3, Cosine decay
	- Merged with `PeftModel.merge_and_unload()` for standalone deployment

	---

	## 💻 Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B")
	model = AutoModelForCausalLM.from_pretrained(
	"devwoo/Kybalion-1B",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	def chat(user_message, system="You are a helpful and knowledgeable AI assistant."):
	prompt = (
	f"<\|begin_of_text\|><\|start_header_id\|>system<\|end_header_id\|>\n\n"
	f"{system}<\|eot_id\|>"
	f"<\|start_header_id\|>user<\|end_header_id\|>\n\n"
	f"{user_message}<\|eot_id\|>"
	f"<\|start_header_id\|>assistant<\|end_header_id\|>\n\n"
	)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.9,
	do_sample=True,
	eos_token_id=tokenizer.convert_tokens_to_ids("<\|eot_id\|>"),
	)
	return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

	print(chat("Explain the Pythagorean theorem and give an example."))
	print(chat("Write a Python function to check if a number is prime."))
	```

	---

	## 📦 GGUF Version

	A quantized GGUF q4_k_m version is available at [devwoo/Kybalion-1B-GGUF](https://huggingface.co/devwoo/Kybalion-1B-GGUF) for CPU/mobile inference with [llama.cpp](https://github.com/ggerganov/llama.cpp) or [Ollama](https://ollama.com).

	```bash
	# With llama.cpp
	./llama-cli -m Kybalion-1B-q4_k_m.gguf -p "Explain quantum computing." -n 256
	```

	---

	## ⚠️ Limitations

	- 1B parameters — smaller than most production models; may struggle with complex multi-step reasoning
	- Not RLHF-aligned; may occasionally produce unhelpful or inconsistent responses
	- English-only training data
	- GSM8K score (10.8%) reflects room for improvement in math reasoning compared to larger models

	---

	## 📄 License

	This model is derived from `meta-llama/Llama-3.2-1B` and follows the [Llama 3.2 Community License](https://ai.meta.com/llama/license/).
	Training datasets are used under their respective open licenses.