--- language: - en - vi license: apache-2.0 pipeline_tag: text-generation tags: - reasoning - compositional-reasoning - qwen3 - lora-finetuned - knowforge --- # KnowForge-0.6B Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset — a synthetic benchmark for **compositional rule-following and structured reasoning** over fabricated rule systems. The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge. --- ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( "qox/knowforge-0.6B", torch_dtype=torch.float16, device_map="auto", trust_remote_code=True, ) messages = [ { "role": "system", "content": ( "You are given rules for a fictional system that does NOT exist in the real world. " "Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. " "Show your reasoning inside ... tags before giving your final answer." ), }, { "role": "user", "content": ( "ZELPH RELATIONS:\n" " stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5\n\n" "Facts:\n" " energy(gamma) = 3\n" " energy(delta) = 12\n\n" "Question: Is delta stronger than gamma?" ), }, ] outputs = model.generate( **tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True), max_new_tokens=256, do_sample=False, ) print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True)) ``` Or use the bundled `inference.py`: ```bash pip install -r requirements.txt python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?" ``` ```python from inference import ask result = ask("ZELPH RELATIONS: ...") print(result["answer"]) # "yes" print(result["reasoning"]) # chain-of-thought inside ``` --- ## Task Description KnowForge presents the model with a **fabricated rule system** (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules — no world knowledge applies. Three transform types are covered: ### 1. `linear_to_cyclic` Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences). > "A clock shows 10. Add 5 hours. What time is it?" → 3 ### 2. `relation_to_graph` Transitive relation queries over a directed graph of entities. > "A is taller than B. B is taller than C. Is A taller than C?" → yes ### 3. `relation_property_check` Structural property checks on declared relation systems (transitivity, symmetry, etc.). > "Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" → conditional Each question may require multi-step reasoning and chain-of-thought inside `...` before the final answer. --- ## Performance Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set: | Metric | Score | |---|---| | **final_correct (test)** | **64.31%** | | **final_correct (adversarial)** | **66.67%** | | executor_success (test) | 94.81% | | transform_acc (test) | 99.64% | | slot_sem_f1 (test) | 0.648 | Comparison against TF-IDF baseline: - TF-IDF final_correct: 15.21% (test), 10.34% (adversarial) - This model: +49.1 pp on test, +56.3 pp on adversarial --- ## Base Model **Qwen3-0.6B** (Apache 2.0) — fine-tuned with LoRA on the KnowForge synthetic dataset. The LoRA adapter was merged into the base weights before publishing; this is a self-contained model. --- ## Limitations - **Synthetic data only.** Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated. - **English and Vietnamese.** Dataset contains both; performance may vary by language. - **Short rule systems.** Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy. - **CPU is slow.** Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5–30 s/query depending on hardware). Use a GPU for interactive use. - **Chain-of-thought required.** The model was trained to emit `...` before answering. Prompts that suppress reasoning may reduce accuracy. - **No world knowledge grounding.** The model will follow stated rules even when they conflict with reality. This is by design.