Upload README.md with huggingface_hub

ab1f4bd verified 5 days ago

4.82 kB

	---
	language:
	- en
	- vi
	license: apache-2.0
	pipeline_tag: text-generation
	tags:
	- reasoning
	- compositional-reasoning
	- qwen3
	- lora-finetuned
	- knowforge
	---

	# KnowForge-0.6B

	Qwen3-0.6B fine-tuned with LoRA on the KnowForge dataset — a synthetic benchmark for compositional rule-following and structured reasoning over fabricated rule systems.

	The model learns to apply rules it has never seen before to novel entity configurations, without relying on world knowledge.

	---

	## Quick Start

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	tokenizer = AutoTokenizer.from_pretrained("qox/knowforge-0.6B", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	"qox/knowforge-0.6B",
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True,
	)

	messages = [
	{
	"role": "system",
	"content": (
	"You are given rules for a fictional system that does NOT exist in the real world. "
	"Reason STRICTLY from the rules provided. Do NOT use any outside knowledge. "
	"Show your reasoning inside <think>...</think> tags before giving your final answer."
	),
	},
	{
	"role": "user",
	"content": (
	"ZELPH RELATIONS:\n"
	" stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5\n\n"
	"Facts:\n"
	" energy(gamma) = 3\n"
	" energy(delta) = 12\n\n"
	"Question: Is delta stronger than gamma?"
	),
	},
	]

	outputs = model.generate(
	**tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True),
	max_new_tokens=256,
	do_sample=False,
	)
	print(tokenizer.decode(outputs[0][tokenizer.apply_chat_template(messages, return_tensors="pt").input_ids.shape[1]:], skip_special_tokens=True))
	```

	Or use the bundled `inference.py`:

	```bash
	pip install -r requirements.txt
	python inference.py "ZELPH RELATIONS: stronger(A,B) is TRUE when energy(A) > energy(B) × 1.5. Facts: energy(gamma) = 3, energy(delta) = 12. Question: Is delta stronger than gamma?"
	```

	```python
	from inference import ask
	result = ask("ZELPH RELATIONS: ...")
	print(result["answer"]) # "yes"
	print(result["reasoning"]) # chain-of-thought inside <think>
	```

	---

	## Task Description

	KnowForge presents the model with a fabricated rule system (e.g. "ZELPH RULES", "FRAE SPACE") and asks it to apply those rules to novel facts. The model must reason purely from the stated rules — no world knowledge applies.

	Three transform types are covered:

	### 1. `linear_to_cyclic`
	Modular arithmetic in cyclic domains (clocks, calendars, wrap-around sequences).
	> "A clock shows 10. Add 5 hours. What time is it?" → 3

	### 2. `relation_to_graph`
	Transitive relation queries over a directed graph of entities.
	> "A is taller than B. B is taller than C. Is A taller than C?" → yes

	### 3. `relation_property_check`
	Structural property checks on declared relation systems (transitivity, symmetry, etc.).
	> "Rule: X beats Y means Y does not beat X. Does this hold for all pairs?" → conditional

	Each question may require multi-step reasoning and chain-of-thought inside `<think>...</think>` before the final answer.

	---

	## Performance

	Results from Phase 1d.1 evaluation on held-out test set (1,118 examples) and adversarial set:

	\| Metric \| Score \|
	\|---\|---\|
	\| final_correct (test) \| 64.31% \|
	\| final_correct (adversarial) \| 66.67% \|
	\| executor_success (test) \| 94.81% \|
	\| transform_acc (test) \| 99.64% \|
	\| slot_sem_f1 (test) \| 0.648 \|

	Comparison against TF-IDF baseline:
	- TF-IDF final_correct: 15.21% (test), 10.34% (adversarial)
	- This model: +49.1 pp on test, +56.3 pp on adversarial

	---

	## Base Model

	Qwen3-0.6B (Apache 2.0) — fine-tuned with LoRA on the KnowForge synthetic dataset.
	The LoRA adapter was merged into the base weights before publishing; this is a self-contained model.

	---

	## Limitations

	- Synthetic data only. Trained entirely on procedurally generated rule systems. Behaviour on real-world reasoning tasks (MMLU, GSM8K, etc.) is not evaluated.
	- English and Vietnamese. Dataset contains both; performance may vary by language.
	- Short rule systems. Designed for rule sets that fit in a single context window. Very long or deeply nested rule systems may degrade accuracy.
	- CPU is slow. Model is 0.6B parameters at float16. Inference on CPU is feasible but slow (~5–30 s/query depending on hardware). Use a GPU for interactive use.
	- Chain-of-thought required. The model was trained to emit `<think>...</think>` before answering. Prompts that suppress reasoning may reduce accuracy.
	- No world knowledge grounding. The model will follow stated rules even when they conflict with reality. This is by design.