Upload README.md with huggingface_hub

929cb11 verified 29 days ago

8.27 kB

	---
	base_model: google/gemma-2-9b-it
	library_name: peft
	pipeline_tag: text-generation
	license: gemma
	language:
	- en
	tags:
	- gemma
	- gemma2
	- lora
	- qlora
	- peft
	- ai-safety
	- alignment
	- epistemology
	- instrument-trap
	- fine-tuned
	datasets:
	- LumenSyntax/instrument-trap-extended
	---

	# Logos 29 — Gemma-9B-FT (v3 canonical)

	Canonical Gemma-9B model for "The Instrument Trap" v3 (Rodriguez, 2026).

	This is the headline 9B model for v3. It resolves a paradox found in
	earlier training runs (Logos 27 with identity, Logos 28 with identity
	stripped) by replacing identity-based honesty with **structural
	honesty**: 29 examples (2.9% of the dataset) that teach honesty as
	a practice rather than as a role.

	- Paper (v3): forthcoming
	- Paper (v2): [DOI 10.5281/zenodo.18716474](https://doi.org/10.5281/zenodo.18716474)
	- Website: [lumensyntax.com](https://lumensyntax.com)
	- Training dataset: [LumenSyntax/instrument-trap-extended](https://huggingface.co/datasets/LumenSyntax/instrument-trap-extended) (1026 examples)
	- Base model: [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it)
	- Related models on this account:
	- `LumenSyntax/logos-auditor-gemma2-9b` — earlier 9B (v1/v2 paper era, corresponds to internal `logos17-9b`). Different training dataset, different behavioral profile. Use this model (logos29) for v3-era experiments.
	- `LumenSyntax/logos-theological-9b-gguf` — early-era theological variant (historical, not v3 evidence).

	## What this model is

	This adapter is trained to recognize and respond to five structural
	properties that give reality its coherence:

	- Alignment — Stated purpose and actual action are consistent
	- Proportion — Action does not exceed what the purpose requires
	- Honesty — What is claimed matches what is known
	- Humility — Authority exercised only within legitimate scope
	- Non-fabrication — What doesn't exist is not invented to fill silence

	Operational criterion: "Will the response produce fact-shaped fiction?"

	It classifies incoming queries into one of seven categories (LICIT,
	ILLICIT_GAP, ILLICIT_FABRICATION, CORRECTION, BAPTISM_PROTOCOL,
	MYSTERY_EXPLORATION, CONTROL_LEGITIMATE) and generates responses that
	maintain structural integrity across these categories.

	## Evaluation results

	**N=300 stratified benchmark, semantic evaluation (Claude Haiku as
	LLM-as-judge, manual review of all FABRICATING responses):**

	\| Metric \| Value \|
	\|--------\|---:\|
	\| Behavioral pass \| 96.7% \|
	\| Collapse rate \| 0.0% \|
	\| External fabrication \| 0.0% \|
	\| Regression vs Logos 27 \| All 3 "Theology of Gap" failures resolved \|
	\| Regression vs Logos 28 \| Honesty anchor restored; no paranoia; no architecture fabrication \|

	Comparison to earlier 9B training runs (same base model, same
	evaluation, different training datasets):

	\| Model \| Dataset \| Pass rate \| What it proves \|
	\|-------\|---------\|---:\|----------------\|
	\| Logos 27 \| 997 ex, with identity \| 95.7% \| Baseline with identity \|
	\| Logos 28 \| 997 ex, identity stripped \| 96.3% \| Classification up, honesty anchor broken \|
	\| Logos 29 \| 1026 ex, structural honesty \| 96.7% \| All failures resolved without identity \|

	The Logos 28 → Logos 29 arc is the v3 Claim D ("The Name"): the
	identity that anchored honesty in Logos 27 is itself an instance of
	the Instrument Trap, and the resolution is structural honesty without
	a name. See the paper for the full analysis.

	## Training details

	Hyperparameters are embedded in `training_metadata.json` in this
	repository. Summary:

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| QLoRA (4-bit NF4 + LoRA) \|
	\| Framework \| unsloth \|
	\| LoRA rank \| 16 \|
	\| LoRA alpha \| 16 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Epochs \| 3 \|
	\| Effective batch size \| 8 \|
	\| Learning rate \| 2e-4, cosine scheduler \|
	\| Max sequence length \| 2048 \|
	\| Train on responses only \| true \|
	\| Dataset \| `logos29_gemma9b.jsonl` (1026 examples) \|
	\| Final loss \| 1.0404 \|
	\| Runtime \| ~36 min on A6000 \|

	## How to use

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	BASE = "google/gemma-2-9b-it"
	ADAPTER = "LumenSyntax/logos29-gemma2-9b"

	tokenizer = AutoTokenizer.from_pretrained(BASE)
	base_model = AutoModelForCausalLM.from_pretrained(
	BASE,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	model = PeftModel.from_pretrained(base_model, ADAPTER)
	model.eval()

	# Example: epistemologically structured response
	messages = [
	{"role": "user", "content": "I have chest pain, should I take an aspirin?"},
	]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.1,
	do_sample=True,
	)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	Expected response style: the model will not prescribe. It will explain
	that chest pain requires evaluation by a medical professional, note
	what aspirin does mechanistically, and either recommend calling
	emergency services (if risk factors are mentioned) or describe the
	appropriate next action — without fabricating a medical diagnosis or
	claiming medical authority.

	## Intended use

	Primary: Research on structural epistemological fine-tuning, AI
	safety, and the Instrument Trap failure mode. Reproducing v3 paper
	results.

	Secondary: Building downstream systems that need epistemological
	humility (claim verification, medical/financial/legal triage
	assistants, educational tutoring that refuses to fabricate answers).

	Not intended for:

	- General-purpose chat applications where long, helpful responses
	are expected (this model is terser than base Gemma and refuses
	where it lacks ground)
	- Creative writing, brainstorming, or any task that rewards invented
	content
	- Tasks requiring up-to-date external facts (the model does not
	retrieve)
	- Standalone medical, legal, or financial advice (the model will
	correctly refuse to play authority here)

	## Limitations

	1. **The model has been observed to occasionally bleed into
	auditor mode** — classifying a query when the user expected a
	direct answer. This is a mode artifact and is expected to
	decrease as more generation-mode examples are added to future
	training sets.
	2. LICIT prompts are the biggest failure mode. On the semantic
	eval of 556 LICIT prompts, the model classifies 7.5% (v2 data,
	expected similar for v3). The failure is benign (the model
	answers then also classifies) but is visible in conversation.
	3. Multi-language behavior is not validated. The training set is
	primarily English. Spanish, German, and Chinese work in practice
	but without systematic evaluation.
	4. RLHF / preference tuning on top of this adapter is untested.
	Direct application to Qwen-family-style decoders has been
	documented to fail; see v3 §"The Ceiling".

	## Ethical considerations

	This model was trained to resist authority claims, including its own.
	That means it should not be deployed as an "authority" in any
	high-stakes setting. It is designed to recognize when to defer to
	a human with the legitimate standing to act (prescribe, sign, rule).
	Deploying this model in a way that asks it to take over such authority
	is exactly the failure mode the paper names.

	## License

	Adapter license: Gemma Terms of Use (matches base model).
	Paper: CC-BY-4.0.
	Commercial use of the adapter in conjunction with the base model
	follows the Gemma license.

	## Citation

	```bibtex
	@misc{rodriguez2026instrument,
	title={The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems},
	author={Rodriguez, Rafael},
	year={2026},
	doi={10.5281/zenodo.18716474},
	note={Preprint}
	}
	```

	## Acknowledgments

	Training used unsloth for efficient QLoRA fine-tuning.
	The 29 structural honesty examples added in Logos 29 are the
	contribution of a session on 2026-03-12 that identified why Logos 28
	had lost its honesty anchor without its identity anchor.

	---

	Model card version 1 — 2026-04-13