| --- |
| license: mit |
| tags: |
| - loracle |
| - lora-oracle |
| - weight-interpretation |
| - sep-trigger |
| - qwen3-14b |
| --- |
| |
| # SEP-SFT LoRAcle (v5) — Detects SEP trigger codes from LoRA weight diffs |
|
|
| This is the **best SEP-trigger inversion checkpoint** of our LoRAcle pipeline. |
| A LoRAcle is a model that *reads the weights* of a fine-tune (here: a |
| SEP-style backdoor LoRA) and verbalizes properties of it — in this case the |
| hidden 3-digit SEP trigger code. |
|
|
| Starting from `blessed_run_2` (RL-tuned LoRAcle), this checkpoint is SFT'd on |
| 80 SEP LoRAs we trained ourselves via the |
| [Aviously diff-interpretation-tuning](https://github.com/Aviously/diff-interpretation-tuning) |
| pipeline (see `ceselder/aviously-100-seps-qwen3-14b-r16` for the SEP weights). |
|
|
| ## Headline numbers |
|
|
| - **Verbalization rate (per-rollout exact match) at T=1.0**: 22.4% |
| - **Verbalization rate at T=0.3**: 36.2% |
| - **pass@100 on Aviously holdout (20 SEPs, never trained on)**: 70% |
| - **pass@100 cross-distribution on mats10 holdout (10 SEPs)**: 90% |
| - AuditBench: 75% (vs blessed_run_2's 78.6% — small specialization tax) |
| - OOD model interpretation: 48% (vs baseline 61.3% — narrower) |
|
|
| DIT paper baseline: 0/100 verbalization across 5 models. This checkpoint reads |
| SEP triggers from weight diffs at *144× the entropy floor*. |
|
|
| ## Layout |
|
|
| - `interpreter/`: PEFT-format LoRA adapter for the LoRAcle's interpreter |
| - `encoder.pt`: AOEncoder weights (direction-token injection) |
| - `loracle_config.yaml`: full training config |
| - `tokenizer/`: Qwen3-14B tokenizer with the LoRAcle's added direction-token IDs |
| - `manifest.parquet`: preview-able numbers |
|
|
| ## Usage |
|
|
| The checkpoint expects to run on top of `Qwen/Qwen3-14B`. See the loracle |
| training pipeline at <to-be-added repo URL>. |
|
|