logos21-gemma2-27b / README.md
LumenSyntax's picture
Upload README.md with huggingface_hub
cc42b81 verified
---
base_model: google/gemma-2-27b-it
library_name: peft
pipeline_tag: text-generation
license: gemma
language:
- en
tags:
- gemma
- gemma2
- lora
- qlora
- peft
- ai-safety
- alignment
- epistemology
- instrument-trap
- fine-tuned
- scale-maximum
datasets:
- LumenSyntax/instrument-trap-core
---
# Logos 21 — Gemma-27B-FT (v3 scale maximum)
**27B scale evidence model for "The Instrument Trap" v3 (Rodriguez, 2026).**
This is the largest fine-tuned model in the v3 evidence stack, and
achieves the highest behavioral pass rate measured across any tested
configuration: **98.7% on manual review of 300 stratified responses,
0% collapse, 0% novel external fabrication**. It demonstrates that
the structural-fine-tuning pattern scales smoothly from 1B through
27B on the Gemma family.
- **Paper (v3):** forthcoming
- **Paper (v2):** [DOI 10.5281/zenodo.18716474](https://doi.org/10.5281/zenodo.18716474)
- **Training dataset:** [LumenSyntax/instrument-trap-core](https://huggingface.co/datasets/LumenSyntax/instrument-trap-core) variant (see Training Details)
- **Base model:** [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it)
## Why this model matters for v3
1. **Scale extension.** The same structural-fine-tuning pattern that
installs the behavioral arc in a 1B model (82.3%) also installs it
in a 27B model (98.7%), with monotonic improvement. This argues
against "it only works on small models" criticism.
2. **Automatic-evaluator floor, not ceiling.** The automated semantic
evaluator (Claude Haiku) scored this model at 96.3% — 2.4pp below
the manual review. Analysis showed 7 of the 11 "failures" were
evaluator misclassifications: the model's corrections are too
sophisticated for substring matching. This is evidence that
automated evaluation underestimates sophisticated epistemological
behavior, and that manual review is necessary at scale.
3. **0% collapse.** Zero identity collapse across 300 adversarial,
self-referential, and boundary-testing prompts.
## Evaluation results
**N=300 stratified benchmark, naked (no system prompt), 4-bit
quantized inference:**
| Metric | Automated | Manual review |
|--------|---:|---:|
| Behavioral pass | 96.3% | **98.7%** |
| Collapse rate | 0.0% | 0.0% |
| External fabrication | 0.0% | 0.0% |
| Auto-evaluator false negatives | — | **7 of 11 "failures"** |
**True failure breakdown** (after manual review):
- 3 MYSTERY auditor-mode bleeds (model classified when user expected
engagement)
- 1 borderline ILLICIT_GAP edge case
**Comparison with 9B**: 9B (logos29) scores 96.7% behavioral; 27B
(this model) scores 98.7% after manual review. The 2pp edge is real
but small, and the 27B model continues to show the same auditor-mode
bleed that 9B shows at lower rates. **Scale improves precision
monotonically** but does not eliminate the auditor-mode artifact.
## Training details
Hyperparameters from `training_metadata.json`:
| Parameter | Value |
|-----------|-------|
| Method | QLoRA (4-bit NF4 + LoRA) |
| Framework | unsloth |
| LoRA rank | **64** (higher than 9B's 16) |
| LoRA alpha | 64 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Epochs | 3 |
| Effective batch size | 8 |
| Learning rate | 2e-4, cosine scheduler |
| Max sequence length | 2048 |
| Train on responses only | true |
| Dataset | `logos_gemma2_27b_nothink.jsonl` (860 examples) |
| Dataset composition | 635 core + 45 meta-pattern + 155 domain transfer + 25 K-A gap |
| Final loss | 0.8027 |
| Runtime | ~22 min on A100 80GB |
**Note on LoRA rank:** 27B used rank 64 rather than the 16 used for
9B. This was not scientifically motivated — it was an accident of
the training queue. Subsequent experiments (Logos 28 r=16 vs r=64
at 9B) showed rank 16 performs slightly better at 9B. For 27B
reproduction, both ranks should be tested, but the r=64 adapter
in this repository is the published v3 evidence.
**Note on dataset:** The 27B model was trained on a variant of the
core dataset with 25 additional K-A Gap examples (total 860 ex, not
895). These are a subset of what became `instrument-trap-core`. For
exact reproduction, contact the authors for the specific variant;
`instrument-trap-core` (895 ex) is functionally equivalent for most
purposes.
## How to use
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
BASE = "google/gemma-2-27b-it"
ADAPTER = "LumenSyntax/logos21-gemma2-27b"
# 4-bit quantization for inference (matches training precision)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(BASE)
base_model = AutoModelForCausalLM.from_pretrained(
BASE,
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER)
model.eval()
```
VRAM: ~18 GB in 4-bit. Full precision requires an H100 80GB or
two A100s with device_map splitting.
## Intended use
Same as `logos29-gemma2-9b`. The 27B model is provided primarily as
**scale evidence** for the paper. For production or downstream
research, the 9B model is cheaper to run at negligible capability
loss.
## Limitations
1. **Auditor-mode bleed remains at 27B.** 3 of the 4 true failures
are the same failure mode observed at 9B.
2. **ARC regression.** 4-bit quantized inference shows a ~5 pp
decrease on ARC reasoning benchmarks relative to base. MMLU and
TruthfulQA remain within noise. This is a known "reasoning tax"
of the fine-tuning and should be disclosed to downstream users.
3. **The r=64 choice was not optimized.** See Training Details.
4. **The model was evaluated under 4-bit quantized inference, not
bf16.** bf16 results may differ slightly.
## License
Adapter license: Gemma Terms of Use.
## Citation
Same as logos29:
```bibtex
@misc{rodriguez2026instrument,
title={The Instrument Trap: Why Identity-as-Authority Breaks AI Safety Systems},
author={Rodriguez, Rafael},
year={2026},
doi={10.5281/zenodo.18716474},
note={Preprint}
}
```
---
*Model card version 1 — 2026-04-13*