---
language:
- en
- vi
license: apache-2.0
pipeline_tag: text-generation
tags:
- reasoning
- compositional-reasoning
- qwen3
- lora-finetuned
- knowforge
---

# KnowForge-0.6B

Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset — a synthetic benchmark for **compositional rule-following and structured reasoning** over fabricated rule systems.

The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge.

---

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "qox/knowforge-0.6B",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {
        "role": "system",
        "content": (
            "You are given rules for a fictional system that does NOT exist in the real world. "
            "Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. "
            "Show your reasoning inside <think>...</think> tags before giving your final answer."
        ),
    },
    {
        "role": "user",
        "content": (
            "ZELPH RELATIONS:\n"
            "  stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5\n\n"
            "Facts:\n"
            "  energy(gamma) = 3\n"
            "  energy(delta) = 12\n\n"
            "Question: Is delta stronger than gamma?"
        ),
    },
]

outputs = model.generate(
    **tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True),
    max_new_tokens=256,
    do_sample=False,
)
print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True))
```

Or use the bundled `inference.py`:

```bash
pip install -r requirements.txt
python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?"
```

```python
from inference import ask
result = ask("ZELPH RELATIONS: ...")
print(result["answer"])     # "yes"
print(result["reasoning"])  # chain-of-thought inside <think>
```

---

## Task Description

KnowForge presents the model with a **fabricated rule system** (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules — no world knowledge applies.

Three transform types are covered:

### 1. `linear_to_cyclic`
Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences).
> "A clock shows 10. Add 5 hours. What time is it?" → 3

### 2. `relation_to_graph`
Transitive relation queries over a directed graph of entities.
> "A is taller than B. B is taller than C. Is A taller than C?" → yes

### 3. `relation_property_check`
Structural property checks on declared relation systems (transitivity, symmetry, etc.).
> "Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" → conditional

Each question may require multi-step reasoning and chain-of-thought inside `<think>...</think>` before the final answer.

---

## Performance

Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set:

| Metric | Score |
|---|---|
| **final_correct (test)** | **64.31%** |
| **final_correct (adversarial)** | **66.67%** |
| executor_success (test) | 94.81% |
| transform_acc (test) | 99.64% |
| slot_sem_f1 (test) | 0.648 |

Comparison against TF-IDF baseline:
- TF-IDF final_correct: 15.21% (test), 10.34% (adversarial)
- This model: +49.1 pp on test, +56.3 pp on adversarial

---

## Base Model

**Qwen3-0.6B** (Apache 2.0) — fine-tuned with LoRA on the KnowForge synthetic dataset.
The LoRA adapter was merged into the base weights before publishing; this is a self-contained model.

---

## Limitations

- **Synthetic data only.** Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated.
- **English and Vietnamese.** Dataset contains both; performance may vary by language.
- **Short rule systems.** Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy.
- **CPU is slow.** Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5–30 s/query depending on hardware). Use a GPU for interactive use.
- **Chain-of-thought required.** The model was trained to emit `<think>...</think>` before answering. Prompts that suppress reasoning may reduce accuracy.
- **No world knowledge grounding.** The model will follow stated rules even when they conflict with reality. This is by design.