| --- |
| language: |
| - en |
| - vi |
| license: apache-2.0 |
| pipeline_tag: text-generation |
| tags: |
| - reasoning |
| - compositional-reasoning |
| - qwen3 |
| - lora-finetuned |
| - knowforge |
| --- |
| |
| # KnowForge-0.6B |
|
|
| Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset β a synthetic benchmark for **compositional rule-following and structured reasoning** over fabricated rule systems. |
|
|
| The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge. |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| "qox/knowforge-0.6B", |
| torch_dtype=torch.float16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [ |
| { |
| "role": "system", |
| "content": ( |
| "You are given rules for a fictional system that does NOT exist in the real world. " |
| "Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. " |
| "Show your reasoning inside <think>...</think> tags before giving your final answer." |
| ), |
| }, |
| { |
| "role": "user", |
| "content": ( |
| "ZELPH RELATIONS:\n" |
| " stronger(A,B) is TRUE when energy(A) > energy(B) Γ 1.5\n\n" |
| "Facts:\n" |
| " energy(gamma) = 3\n" |
| " energy(delta) = 12\n\n" |
| "Question: Is delta stronger than gamma?" |
| ), |
| }, |
| ] |
| |
| outputs = model.generate( |
| **tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True), |
| max_new_tokens=256, |
| do_sample=False, |
| ) |
| print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True)) |
| ``` |
|
|
| Or use the bundled `inference.py`: |
|
|
| ```bash |
| pip install -r requirements.txt |
| python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) Γ 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?" |
| ``` |
|
|
| ```python |
| from inference import ask |
| result = ask("ZELPH RELATIONS: ...") |
| print(result["answer"]) # "yes" |
| print(result["reasoning"]) # chain-of-thought inside <think> |
| ``` |
|
|
| --- |
|
|
| ## Task Description |
|
|
| KnowForge presents the model with a **fabricated rule system** (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules β no world knowledge applies. |
|
|
| Three transform types are covered: |
|
|
| ### 1. `linear_to_cyclic` |
| Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences). |
| > "A clock shows 10. Add 5 hours. What time is it?" β 3 |
|
|
| ### 2. `relation_to_graph` |
| Transitive relation queries over a directed graph of entities. |
| > "A is taller than B. B is taller than C. Is A taller than C?" β yes |
|
|
| ### 3. `relation_property_check` |
| Structural property checks on declared relation systems (transitivity, symmetry, etc.). |
| > "Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" β conditional |
|
|
| Each question may require multi-step reasoning and chain-of-thought inside `<think>...</think>` before the final answer. |
|
|
| --- |
|
|
| ## Performance |
|
|
| Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set: |
|
|
| | Metric | Score | |
| |---|---| |
| | **final_correct (test)** | **64.31%** | |
| | **final_correct (adversarial)** | **66.67%** | |
| | executor_success (test) | 94.81% | |
| | transform_acc (test) | 99.64% | |
| | slot_sem_f1 (test) | 0.648 | |
|
|
| Comparison against TF-IDF baseline: |
| - TF-IDF final_correct: 15.21% (test), 10.34% (adversarial) |
| - This model: +49.1 pp on test, +56.3 pp on adversarial |
| |
| --- |
| |
| ## Base Model |
| |
| **Qwen3-0.6B** (Apache 2.0) β fine-tuned with LoRA on the KnowForge synthetic dataset. |
| The LoRA adapter was merged into the base weights before publishing; this is a self-contained model. |
| |
| --- |
| |
| ## Limitations |
| |
| - **Synthetic data only.** Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated. |
| - **English and Vietnamese.** Dataset contains both; performance may vary by language. |
| - **Short rule systems.** Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy. |
| - **CPU is slow.** Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5β30 s/query depending on hardware). Use a GPU for interactive use. |
| - **Chain-of-thought required.** The model was trained to emit `<think>...</think>` before answering. Prompts that suppress reasoning may reduce accuracy. |
| - **No world knowledge grounding.** The model will follow stated rules even when they conflict with reality. This is by design. |
| |