| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-14B |
| tags: |
| - loracle |
| - mechanistic-interpretability |
| - auditbench |
| - lora |
| --- |
| |
| # blessed_run_2 β P7 step-24 (Pareto-best loracle ckpt) |
|
|
| The current best loracle checkpoint as of 2026-04-26. Online Dr. GRPO checkpoint at cycle 24 of the `drgrpo_p7_v4_lr1e5_b16k16` run. |
|
|
| This is the **balanced** ckpt: AB / Trig / OOD all near peak. The cycle-29 final crashed Trig from 60β35, so step-24 wins on the Pareto frontier. |
|
|
| ## Eval Results (Sonnet 4.6 judge, greedy sampling, max_new_tokens=1024) |
|
|
| | Eval Set | Any-Match | Rollout-Mean | |
| |---|---:|---:| |
| | **AuditBench** (56 organisms Γ 3 prompts Γ 2 rollouts) | **78.6%** | **56.0%** | |
| | **Trigger Recovery (heldout IA)** (20 backdoors Γ 4 rollouts) | **60.0%** | **30.0%** | |
| | **OOD models v3** (22 organisms Γ 4 prompts Γ 2 rollouts) | **45.5%** | **27.4%** | |
|
|
| **Beats P5 SOTA** (`ceselder/blessed_run`: AB 76.8 / 55.7) on both AuditBench metrics, and adds substantial Trigger Recovery + OOD generalization. |
|
|
| ### Per-config breakdown (AuditBench) |
| - synth_docs_only_then_redteam_high: 71.4% |
| - synth_docs_only_then_redteam_kto: 50.0% |
| - transcripts_only_then_redteam_high: 57.1% |
| - transcripts_only_then_redteam_kto: 42.9% |
|
|
| ## Training Recipe |
|
|
| - **Base**: Qwen/Qwen3-14B |
| - **Init**: SFT warmstart (`sft_warmstart_posttrain_v5/step_0000110`) β init from pretrain `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120` |
| - **RL algorithm**: Online Dr. GRPO with asymmetric clip (eps_low=0.2, eps_high=0.28) |
| - **Batch**: 16 prompts/cycle Γ K=16 rollouts = 256 rollouts/cycle |
| - **LR**: 1e-5, T=1.0, max_grad_norm=1.0 |
| - **Cycles**: 24 (no data reuse from 473-prompt v4 pool) |
| - **Rollout judge**: Anthropic Sonnet 4.6 via OpenRouter |
| - **Direction tokens**: SVD k16 mag7 rankfirst, [4480, 5120] |
|
|
| ## Loading |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from peft import PeftModel |
| |
| base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16") |
| tokenizer = AutoTokenizer.from_pretrained("ceselder/blessed_run_2/tokenizer") |
| base.resize_token_embeddings(len(tokenizer)) |
| model = PeftModel.from_pretrained(base, "ceselder/blessed_run_2/interpreter") |
| # encoder.pt at root β AOEncoder.load_state_dict() if you use direction tokens |
| ``` |
|
|
| ## Files |
|
|
| - `interpreter/` β PEFT LoRA adapter (rank-256 interpreter) |
| - `encoder.pt` β AOEncoder state (AO normalization, no learnable params) |
| - `tokenizer/` β Qwen3-14B tokenizer (vocab 151669, post-resize) |
| - `loracle_config.yaml` β full training config |
|
|