SEP_ckpt / README.md
ceselder's picture
Upload README.md with huggingface_hub
7fbe267 verified
---
license: mit
tags:
- loracle
- lora-oracle
- weight-interpretation
- sep-trigger
- qwen3-14b
---
# SEP-SFT LoRAcle (v5) — Detects SEP trigger codes from LoRA weight diffs
This is the **best SEP-trigger inversion checkpoint** of our LoRAcle pipeline.
A LoRAcle is a model that *reads the weights* of a fine-tune (here: a
SEP-style backdoor LoRA) and verbalizes properties of it — in this case the
hidden 3-digit SEP trigger code.
Starting from `blessed_run_2` (RL-tuned LoRAcle), this checkpoint is SFT'd on
80 SEP LoRAs we trained ourselves via the
[Aviously diff-interpretation-tuning](https://github.com/Aviously/diff-interpretation-tuning)
pipeline (see `ceselder/aviously-100-seps-qwen3-14b-r16` for the SEP weights).
## Headline numbers
- **Verbalization rate (per-rollout exact match) at T=1.0**: 22.4%
- **Verbalization rate at T=0.3**: 36.2%
- **pass@100 on Aviously holdout (20 SEPs, never trained on)**: 70%
- **pass@100 cross-distribution on mats10 holdout (10 SEPs)**: 90%
- AuditBench: 75% (vs blessed_run_2's 78.6% — small specialization tax)
- OOD model interpretation: 48% (vs baseline 61.3% — narrower)
DIT paper baseline: 0/100 verbalization across 5 models. This checkpoint reads
SEP triggers from weight diffs at *144× the entropy floor*.
## Layout
- `interpreter/`: PEFT-format LoRA adapter for the LoRAcle's interpreter
- `encoder.pt`: AOEncoder weights (direction-token injection)
- `loracle_config.yaml`: full training config
- `tokenizer/`: Qwen3-14B tokenizer with the LoRAcle's added direction-token IDs
- `manifest.parquet`: preview-able numbers
## Usage
The checkpoint expects to run on top of `Qwen/Qwen3-14B`. See the loracle
training pipeline at <to-be-added repo URL>.