| --- |
| language: |
| - en |
| license: llama3.2 |
| base_model: meta-llama/Llama-3.2-1B |
| tags: |
| - llama |
| - continued-pretraining |
| - sft |
| - lora |
| - 1b |
| - math |
| - code |
| - education |
| - small-llm |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| - open-web-math/open-web-math |
| - bigcode/starcoderdata |
| - HuggingFaceTB/cosmopedia |
| - teknium/OpenHermes-2.5 |
| - meta-math/MetaMathQA |
| - sahil2801/CodeAlpaca-20k |
| --- |
| |
| # Kybalion-1B |
|
|
| **Kybalion-1B** is a 1B-parameter language model built on top of [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) through a full **Continued Pre-Training (CPT) β Supervised Fine-Tuning (SFT)** pipeline, trained entirely on Google Colab A100. |
|
|
| > **Why "Kybalion"?** |
| > The model was originally developed under the internal codename *Prometheus-1B*, but was renamed to *Kybalion-1B* before public release to avoid confusion with an existing model of the same name on HuggingFace. *Kybalion* refers to the ancient hermetic text symbolizing hidden knowledge β fitting for a model focused on education, mathematics, science, and code. |
|
|
| --- |
|
|
| ## π Key Highlights |
|
|
| - **Beats Llama-3.2-1B-Instruct** on HellaSwag (63.8% vs 61.1%) and ties on WinoGrande (62.4%) |
| - **4.5Γ GSM8K improvement** over TinyLlama-1.1B (10.8% vs 2.4%) β math pretraining works |
| - Outperforms TinyLlama-1.1B on **all 6 benchmarks** |
| - Trained by a single undergraduate student on consumer cloud hardware |
|
|
| --- |
|
|
| ## π¬ Key Contributions |
|
|
| - Demonstrates that domain-balanced continued pretraining on curated multi-domain data (education, math, code, science) yields consistent improvements across commonsense reasoning benchmarks in 1B-scale models |
| - Suggests that multi-step mathematical reasoning remains a fundamental bottleneck for 1B-scale models, even when combining math-focused pretraining (OpenWebMath) with instruction tuning (MetaMathQA) |
| - Provides a fully reproducible, compute-efficient training recipe (CPT β LoRA SFT) built and executed **by a single undergraduate student in under one week**, demonstrating that meaningful LLM research is achievable without institutional resources or large teams |
|
|
| --- |
|
|
| ## π Benchmark Results |
|
|
| All scores measured with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) under **identical conditions** (same prompts, same few-shot settings, same hardware). |
|
|
| | Benchmark | TinyLlama-1.1B | Llama-3.2-1B-Instruct | **Kybalion-1B** | |
| |-----------|:--------------:|:---------------------:|:---------------:| |
| | MMLU | 25.0% | 46.1% | **32.0%** | |
| | ARC-C | 37.2% | 41.5% | **37.6%** | |
| | GSM8K | 2.4% | 33.5% | **10.8%** | |
| | HellaSwag | 61.2% | 61.1% | **63.8%** π | |
| | WinoGrande | 61.8% | 62.4% | **62.4%** π | |
| | TruthfulQA | 37.4% | 43.3% | **40.0%** | |
|
|
| > π = outperforms Llama-3.2-1B-Instruct |
| > All evaluations run with `lm_eval.simple_evaluate()`, bfloat16, batch_size=8, A100 GPU. |
| |
| --- |
| |
| ## π§ Training Pipeline |
| |
| ### Phase 1: Continued Pre-Training (CPT) |
| |
| Fine-tuned the base weights of `meta-llama/Llama-3.2-1B` on ~3.5B tokens of curated multi-domain data. |
| |
| | Domain | Dataset | Ratio | Purpose | |
| |--------|---------|-------|---------| |
| | Education | FineWeb-Edu (score β₯ 3.0) | 35% | General knowledge & reasoning | |
| | Mathematics | OpenWebMath | 20% | Mathematical reasoning | |
| | Code | StarCoderData (Python) | 15% | Code generation | |
| | Textbook | Cosmopedia web_samples_v2 | 15% | Structured knowledge | |
| | Science | Cosmopedia stanford | 10% | Scientific reasoning | |
| | Story | Cosmopedia stories | 5% | Language fluency | |
| |
| **Training config:** |
| - Hardware: Google Colab A100 80GB |
| - Optimizer: AdamW, LR = 2e-5, Cosine decay, Warmup = 1000 steps |
| - Precision: BF16 |
| - Effective batch size: 32 (4 Γ 8 grad accum) |
| - Sequence length: 2048 (packed) |
| - Framework: HuggingFace `transformers.Trainer` (no Unsloth) |
| |
| ### Phase 2: Supervised Fine-Tuning (SFT) |
| |
| Applied LoRA adapters to teach instruction-following, then merged into base weights. |
| |
| | Dataset | Size | Purpose | |
| |---------|------|---------| |
| | OpenHermes 2.5 | 100K | General instruction following | |
| | MetaMathQA | 50K | Mathematical reasoning (GSM8K boost) | |
| | CodeAlpaca | 20K | Code generation | |
| |
| **SFT config:** |
| - Method: LoRA (r=64, Ξ±=128, dropout=0.05) |
| - Target modules: q/k/v/o/gate/up/down proj (all linear layers) |
| - LR = 1e-4, Epochs = 3, Cosine decay |
| - Merged with `PeftModel.merge_and_unload()` for standalone deployment |
| |
| --- |
| |
| ## π» Usage |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B") |
| model = AutoModelForCausalLM.from_pretrained( |
| "devwoo/Kybalion-1B", |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| |
| def chat(user_message, system="You are a helpful and knowledgeable AI assistant."): |
| prompt = ( |
| f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n" |
| f"{system}<|eot_id|>" |
| f"<|start_header_id|>user<|end_header_id|>\n\n" |
| f"{user_message}<|eot_id|>" |
| f"<|start_header_id|>assistant<|end_header_id|>\n\n" |
| ) |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| with torch.no_grad(): |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=512, |
| temperature=0.7, |
| top_p=0.9, |
| do_sample=True, |
| eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"), |
| ) |
| return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) |
| |
| print(chat("Explain the Pythagorean theorem and give an example.")) |
| print(chat("Write a Python function to check if a number is prime.")) |
| ``` |
| |
| --- |
| |
| ## π¦ GGUF Version |
| |
| A quantized **GGUF q4_k_m** version is available at [devwoo/Kybalion-1B-GGUF](https://huggingface.co/devwoo/Kybalion-1B-GGUF) for CPU/mobile inference with [llama.cpp](https://github.com/ggerganov/llama.cpp) or [Ollama](https://ollama.com). |
| |
| ```bash |
| # With llama.cpp |
| ./llama-cli -m Kybalion-1B-q4_k_m.gguf -p "Explain quantum computing." -n 256 |
| ``` |
| |
| --- |
| |
| ## β οΈ Limitations |
| |
| - 1B parameters β smaller than most production models; may struggle with complex multi-step reasoning |
| - Not RLHF-aligned; may occasionally produce unhelpful or inconsistent responses |
| - English-only training data |
| - GSM8K score (10.8%) reflects room for improvement in math reasoning compared to larger models |
| |
| --- |
| |
| ## π License |
| |
| This model is derived from `meta-llama/Llama-3.2-1B` and follows the [Llama 3.2 Community License](https://ai.meta.com/llama/license/). |
| Training datasets are used under their respective open licenses. |
| |