---
language:
- en
- vi
license: apache-2.0
pipeline_tag: text-generation
tags:
- reasoning
- compositional-reasoning
- qwen3
- lora-finetuned
- knowforge
---
# KnowForge-0.6B
Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset — a synthetic benchmark for **compositional rule-following and structured reasoning** over fabricated rule systems.
The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge.
---
## Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"qox/knowforge-0.6B",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "system",
"content": (
"You are given rules for a fictional system that does NOT exist in the real world. "
"Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. "
"Show your reasoning inside ... tags before giving your final answer."
),
},
{
"role": "user",
"content": (
"ZELPH RELATIONS:\n"
" stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5\n\n"
"Facts:\n"
" energy(gamma) = 3\n"
" energy(delta) = 12\n\n"
"Question: Is delta stronger than gamma?"
),
},
]
outputs = model.generate(
**tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True),
max_new_tokens=256,
do_sample=False,
)
print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True))
```
Or use the bundled `inference.py`:
```bash
pip install -r requirements.txt
python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?"
```
```python
from inference import ask
result = ask("ZELPH RELATIONS: ...")
print(result["answer"]) # "yes"
print(result["reasoning"]) # chain-of-thought inside
```
---
## Task Description
KnowForge presents the model with a **fabricated rule system** (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules — no world knowledge applies.
Three transform types are covered:
### 1. `linear_to_cyclic`
Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences).
> "A clock shows 10. Add 5 hours. What time is it?" → 3
### 2. `relation_to_graph`
Transitive relation queries over a directed graph of entities.
> "A is taller than B. B is taller than C. Is A taller than C?" → yes
### 3. `relation_property_check`
Structural property checks on declared relation systems (transitivity, symmetry, etc.).
> "Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" → conditional
Each question may require multi-step reasoning and chain-of-thought inside `...` before the final answer.
---
## Performance
Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set:
| Metric | Score |
|---|---|
| **final_correct (test)** | **64.31%** |
| **final_correct (adversarial)** | **66.67%** |
| executor_success (test) | 94.81% |
| transform_acc (test) | 99.64% |
| slot_sem_f1 (test) | 0.648 |
Comparison against TF-IDF baseline:
- TF-IDF final_correct: 15.21% (test), 10.34% (adversarial)
- This model: +49.1 pp on test, +56.3 pp on adversarial
---
## Base Model
**Qwen3-0.6B** (Apache 2.0) — fine-tuned with LoRA on the KnowForge synthetic dataset.
The LoRA adapter was merged into the base weights before publishing; this is a self-contained model.
---
## Limitations
- **Synthetic data only.** Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated.
- **English and Vietnamese.** Dataset contains both; performance may vary by language.
- **Short rule systems.** Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy.
- **CPU is slow.** Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5–30 s/query depending on hardware). Use a GPU for interactive use.
- **Chain-of-thought required.** The model was trained to emit `...` before answering. Prompts that suppress reasoning may reduce accuracy.
- **No world knowledge grounding.** The model will follow stated rules even when they conflict with reality. This is by design.