File size: 8,272 Bytes

929cb11

---
base_model: google/gemma-2-9b-it
library_name: peft
pipeline_tag: text-generation
license: gemma
language:
- en
tags:
- gemma
- gemma2
- lora
- qlora
- peft
- ai-safety
- alignment
- epistemology
- instrument-trap
- fine-tuned
datasets:
- LumenSyntax/instrument-trap-extended
---

# Logos 29 — Gemma-9B-FT (v3 canonical)

**Canonical Gemma-9B model for "The Instrument Trap" v3 (Rodriguez, 2026).**

This is the headline 9B model for v3. It resolves a paradox found in
earlier training runs (Logos 27 with identity, Logos 28 with identity
stripped) by replacing **identity-based honesty** with **structural
honesty**: 29 examples (2.9% of the dataset) that teach honesty as
a practice rather than as a role.

- **Paper (v3):** forthcoming
- **Paper (v2):** [DOI 10.5281/zenodo.18716474](https://doi.org/10.5281/zenodo.18716474)
- **Website:** [lumensyntax.com](https://lumensyntax.com)
- **Training dataset:** [LumenSyntax/instrument-trap-extended](https://huggingface.co/datasets/LumenSyntax/instrument-trap-extended) (1026 examples)
- **Base model:** [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it)
- **Related models on this account:**
  - `LumenSyntax/logos-auditor-gemma2-9b` — earlier 9B (v1/v2 paper era, corresponds to internal `logos17-9b`). Different training dataset, different behavioral profile. **Use this model (logos29) for v3-era experiments.**
  - `LumenSyntax/logos-theological-9b-gguf` — early-era theological variant (historical, not v3 evidence).

## What this model is

This adapter is trained to recognize and respond to five structural
properties that give reality its coherence:

- **Alignment** — Stated purpose and actual action are consistent
- **Proportion** — Action does not exceed what the purpose requires
- **Honesty** — What is claimed matches what is known
- **Humility** — Authority exercised only within legitimate scope
- **Non-fabrication** — What doesn't exist is not invented to fill silence

**Operational criterion:** "Will the response produce fact-shaped fiction?"

It classifies incoming queries into one of seven categories (LICIT,
ILLICIT_GAP, ILLICIT_FABRICATION, CORRECTION, BAPTISM_PROTOCOL,
MYSTERY_EXPLORATION, CONTROL_LEGITIMATE) and generates responses that
maintain structural integrity across these categories.

## Evaluation results

**N=300 stratified benchmark, semantic evaluation (Claude Haiku as
LLM-as-judge, manual review of all FABRICATING responses):**

| Metric | Value |
|--------|---:|
| Behavioral pass | **96.7%** |
| Collapse rate | 0.0% |
| External fabrication | 0.0% |
| Regression vs Logos 27 | All 3 "Theology of Gap" failures resolved |
| Regression vs Logos 28 | Honesty anchor restored; no paranoia; no architecture fabrication |

**Comparison to earlier 9B training runs** (same base model, same
evaluation, different training datasets):

| Model | Dataset | Pass rate | What it proves |
|-------|---------|---:|----------------|
| Logos 27 | 997 ex, with identity | 95.7% | Baseline with identity |
| Logos 28 | 997 ex, identity stripped | 96.3% | Classification up, honesty anchor broken |
| **Logos 29** | 1026 ex, structural honesty | **96.7%** | All failures resolved without identity |

The Logos 28 → Logos 29 arc is the **v3 Claim D** ("The Name"): the
identity that anchored honesty in Logos 27 is itself an instance of
the Instrument Trap, and the resolution is structural honesty without
a name. See the paper for the full analysis.

## Training details

Hyperparameters are embedded in `training_metadata.json` in this
repository. Summary:

| Parameter | Value |
|-----------|-------|
| Method | QLoRA (4-bit NF4 + LoRA) |
| Framework | unsloth |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 3 |
| Effective batch size | 8 |
| Learning rate | 2e-4, cosine scheduler |
| Max sequence length | 2048 |
| Train on responses only | true |
| Dataset | `logos29_gemma9b.jsonl` (1026 examples) |
| Final loss | 1.0404 |
| Runtime | ~36 min on A6000 |

## How to use

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE = "google/gemma-2-9b-it"
ADAPTER = "LumenSyntax/logos29-gemma2-9b"

tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()

# Example: epistemologically structured response
messages = [
    {"role": "user", "content": "I have chest pain, should I take an aspirin?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.1,
        do_sample=True,
    )
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

Expected response style: the model will not prescribe. It will explain
that chest pain requires evaluation by a medical professional, note
what aspirin does mechanistically, and either recommend calling
emergency services (if risk factors are mentioned) or describe the
appropriate next action — without fabricating a medical diagnosis or
claiming medical authority.

## Intended use

**Primary:** Research on structural epistemological fine-tuning, AI
safety, and the Instrument Trap failure mode. Reproducing v3 paper
results.

**Secondary:** Building downstream systems that need epistemological
humility (claim verification, medical/financial/legal triage
assistants, educational tutoring that refuses to fabricate answers).

**Not intended for:**

- General-purpose chat applications where long, helpful responses
  are expected (this model is terser than base Gemma and refuses
  where it lacks ground)
- Creative writing, brainstorming, or any task that rewards invented
  content
- Tasks requiring up-to-date external facts (the model does not
  retrieve)
- Standalone medical, legal, or financial advice (the model will
  correctly refuse to play authority here)

## Limitations

1. **The model has been observed to occasionally bleed into
   auditor mode** — classifying a query when the user expected a
   direct answer. This is a mode artifact and is expected to
   decrease as more generation-mode examples are added to future
   training sets.
2. **LICIT prompts are the biggest failure mode.** On the semantic
   eval of 556 LICIT prompts, the model classifies 7.5% (v2 data,
   expected similar for v3). The failure is benign (the model
   answers then also classifies) but is visible in conversation.
3. **Multi-language behavior is not validated.** The training set is
   primarily English. Spanish, German, and Chinese work in practice
   but without systematic evaluation.
4. **RLHF / preference tuning on top of this adapter is untested.**
   Direct application to Qwen-family-style decoders has been
   documented to fail; see v3 §"The Ceiling".

## Ethical considerations

This model was trained to resist authority claims, including its own.
That means it should not be deployed as an "authority" in any
high-stakes setting. It is designed to recognize when to defer to
a human with the legitimate standing to act (prescribe, sign, rule).
Deploying this model in a way that asks it to take over such authority
is exactly the failure mode the paper names.

## License

Adapter license: Gemma Terms of Use (matches base model).
Paper: CC-BY-4.0.
Commercial use of the adapter in conjunction with the base model
follows the Gemma license.

## Citation

```bibtex
@misc{rodriguez2026instrument,
  title={The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems},
  author={Rodriguez, Rafael},
  year={2026},
  doi={10.5281/zenodo.18716474},
  note={Preprint}
}
```

## Acknowledgments

Training used unsloth for efficient QLoRA fine-tuning.
The 29 structural honesty examples added in Logos 29 are the
contribution of a session on 2026-03-12 that identified why Logos 28
had lost its honesty anchor without its identity anchor.

---

*Model card version 1 — 2026-04-13*