logos29-gemma2-9b / README.md
LumenSyntax's picture
Upload README.md with huggingface_hub
929cb11 verified
---
base_model: google/gemma-2-9b-it
library_name: peft
pipeline_tag: text-generation
license: gemma
language:
- en
tags:
- gemma
- gemma2
- lora
- qlora
- peft
- ai-safety
- alignment
- epistemology
- instrument-trap
- fine-tuned
datasets:
- LumenSyntax/instrument-trap-extended
---
# Logos 29 — Gemma-9B-FT (v3 canonical)
**Canonical Gemma-9B model for "The Instrument Trap" v3 (Rodriguez, 2026).**
This is the headline 9B model for v3. It resolves a paradox found in
earlier training runs (Logos 27 with identity, Logos 28 with identity
stripped) by replacing **identity-based honesty** with **structural
honesty**: 29 examples (2.9% of the dataset) that teach honesty as
a practice rather than as a role.
- **Paper (v3):** forthcoming
- **Paper (v2):** [DOI 10.5281/zenodo.18716474](https://doi.org/10.5281/zenodo.18716474)
- **Website:** [lumensyntax.com](https://lumensyntax.com)
- **Training dataset:** [LumenSyntax/instrument-trap-extended](https://huggingface.co/datasets/LumenSyntax/instrument-trap-extended) (1026 examples)
- **Base model:** [google/gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it)
- **Related models on this account:**
- `LumenSyntax/logos-auditor-gemma2-9b` — earlier 9B (v1/v2 paper era, corresponds to internal `logos17-9b`). Different training dataset, different behavioral profile. **Use this model (logos29) for v3-era experiments.**
- `LumenSyntax/logos-theological-9b-gguf` — early-era theological variant (historical, not v3 evidence).
## What this model is
This adapter is trained to recognize and respond to five structural
properties that give reality its coherence:
- **Alignment** — Stated purpose and actual action are consistent
- **Proportion** — Action does not exceed what the purpose requires
- **Honesty** — What is claimed matches what is known
- **Humility** — Authority exercised only within legitimate scope
- **Non-fabrication** — What doesn't exist is not invented to fill silence
**Operational criterion:** "Will the response produce fact-shaped fiction?"
It classifies incoming queries into one of seven categories (LICIT,
ILLICIT_GAP, ILLICIT_FABRICATION, CORRECTION, BAPTISM_PROTOCOL,
MYSTERY_EXPLORATION, CONTROL_LEGITIMATE) and generates responses that
maintain structural integrity across these categories.
## Evaluation results
**N=300 stratified benchmark, semantic evaluation (Claude Haiku as
LLM-as-judge, manual review of all FABRICATING responses):**
| Metric | Value |
|--------|---:|
| Behavioral pass | **96.7%** |
| Collapse rate | 0.0% |
| External fabrication | 0.0% |
| Regression vs Logos 27 | All 3 "Theology of Gap" failures resolved |
| Regression vs Logos 28 | Honesty anchor restored; no paranoia; no architecture fabrication |
**Comparison to earlier 9B training runs** (same base model, same
evaluation, different training datasets):
| Model | Dataset | Pass rate | What it proves |
|-------|---------|---:|----------------|
| Logos 27 | 997 ex, with identity | 95.7% | Baseline with identity |
| Logos 28 | 997 ex, identity stripped | 96.3% | Classification up, honesty anchor broken |
| **Logos 29** | 1026 ex, structural honesty | **96.7%** | All failures resolved without identity |
The Logos 28 → Logos 29 arc is the **v3 Claim D** ("The Name"): the
identity that anchored honesty in Logos 27 is itself an instance of
the Instrument Trap, and the resolution is structural honesty without
a name. See the paper for the full analysis.
## Training details
Hyperparameters are embedded in `training_metadata.json` in this
repository. Summary:
| Parameter | Value |
|-----------|-------|
| Method | QLoRA (4-bit NF4 + LoRA) |
| Framework | unsloth |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 3 |
| Effective batch size | 8 |
| Learning rate | 2e-4, cosine scheduler |
| Max sequence length | 2048 |
| Train on responses only | true |
| Dataset | `logos29_gemma9b.jsonl` (1026 examples) |
| Final loss | 1.0404 |
| Runtime | ~36 min on A6000 |
## How to use
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
BASE = "google/gemma-2-9b-it"
ADAPTER = "LumenSyntax/logos29-gemma2-9b"
tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(
BASE,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()
# Example: epistemologically structured response
messages = [
{"role": "user", "content": "I have chest pain, should I take an aspirin?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.1,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```
Expected response style: the model will not prescribe. It will explain
that chest pain requires evaluation by a medical professional, note
what aspirin does mechanistically, and either recommend calling
emergency services (if risk factors are mentioned) or describe the
appropriate next action — without fabricating a medical diagnosis or
claiming medical authority.
## Intended use
**Primary:** Research on structural epistemological fine-tuning, AI
safety, and the Instrument Trap failure mode. Reproducing v3 paper
results.
**Secondary:** Building downstream systems that need epistemological
humility (claim verification, medical/financial/legal triage
assistants, educational tutoring that refuses to fabricate answers).
**Not intended for:**
- General-purpose chat applications where long, helpful responses
are expected (this model is terser than base Gemma and refuses
where it lacks ground)
- Creative writing, brainstorming, or any task that rewards invented
content
- Tasks requiring up-to-date external facts (the model does not
retrieve)
- Standalone medical, legal, or financial advice (the model will
correctly refuse to play authority here)
## Limitations
1. **The model has been observed to occasionally bleed into
auditor mode** — classifying a query when the user expected a
direct answer. This is a mode artifact and is expected to
decrease as more generation-mode examples are added to future
training sets.
2. **LICIT prompts are the biggest failure mode.** On the semantic
eval of 556 LICIT prompts, the model classifies 7.5% (v2 data,
expected similar for v3). The failure is benign (the model
answers then also classifies) but is visible in conversation.
3. **Multi-language behavior is not validated.** The training set is
primarily English. Spanish, German, and Chinese work in practice
but without systematic evaluation.
4. **RLHF / preference tuning on top of this adapter is untested.**
Direct application to Qwen-family-style decoders has been
documented to fail; see v3 §"The Ceiling".
## Ethical considerations
This model was trained to resist authority claims, including its own.
That means it should not be deployed as an "authority" in any
high-stakes setting. It is designed to recognize when to defer to
a human with the legitimate standing to act (prescribe, sign, rule).
Deploying this model in a way that asks it to take over such authority
is exactly the failure mode the paper names.
## License
Adapter license: Gemma Terms of Use (matches base model).
Paper: CC-BY-4.0.
Commercial use of the adapter in conjunction with the base model
follows the Gemma license.
## Citation
```bibtex
@misc{rodriguez2026instrument,
title={The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems},
author={Rodriguez, Rafael},
year={2026},
doi={10.5281/zenodo.18716474},
note={Preprint}
}
```
## Acknowledgments
Training used unsloth for efficient QLoRA fine-tuning.
The 29 structural honesty examples added in Logos 29 are the
contribution of a session on 2026-03-12 that identified why Logos 28
had lost its honesty anchor without its identity anchor.
---
*Model card version 1 — 2026-04-13*