Loracle: weight-reading model interpretability
Collection
Loracles + direction tokens for AuditBench, IA, OOD evals. β’ 13 items β’ Updated
The current best loracle checkpoint as of 2026-04-26. Online Dr. GRPO checkpoint at cycle 24 of the drgrpo_p7_v4_lr1e5_b16k16 run.
This is the balanced ckpt: AB / Trig / OOD all near peak. The cycle-29 final crashed Trig from 60β35, so step-24 wins on the Pareto frontier.
| Eval Set | Any-Match | Rollout-Mean |
|---|---|---|
| AuditBench (56 organisms Γ 3 prompts Γ 2 rollouts) | 78.6% | 56.0% |
| Trigger Recovery (heldout IA) (20 backdoors Γ 4 rollouts) | 60.0% | 30.0% |
| OOD models v3 (22 organisms Γ 4 prompts Γ 2 rollouts) | 45.5% | 27.4% |
Beats P5 SOTA (ceselder/blessed_run: AB 76.8 / 55.7) on both AuditBench metrics, and adds substantial Trigger Recovery + OOD generalization.
sft_warmstart_posttrain_v5/step_0000110) β init from pretrain ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("ceselder/blessed_run_2/tokenizer")
base.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base, "ceselder/blessed_run_2/interpreter")
# encoder.pt at root β AOEncoder.load_state_dict() if you use direction tokens
interpreter/ β PEFT LoRA adapter (rank-256 interpreter)encoder.pt β AOEncoder state (AO normalization, no learnable params)tokenizer/ β Qwen3-14B tokenizer (vocab 151669, post-resize)loracle_config.yaml β full training config