File size: 2,536 Bytes
86a6e67
275f547
86a6e67
275f547
 
 
 
 
86a6e67
 
275f547
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86a6e67
275f547
 
 
 
 
 
 
 
86a6e67
275f547
86a6e67
275f547
 
 
86a6e67
275f547
 
 
 
 
 
86a6e67
275f547
86a6e67
275f547
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: apache-2.0
base_model: Qwen/Qwen3-14B
tags:
- loracle
- mechanistic-interpretability
- auditbench
- lora
---

# blessed_run_2 — P7 step-24 (Pareto-best loracle ckpt)

The current best loracle checkpoint as of 2026-04-26. Online Dr. GRPO checkpoint at cycle 24 of the `drgrpo_p7_v4_lr1e5_b16k16` run.

This is the **balanced** ckpt: AB / Trig / OOD all near peak. The cycle-29 final crashed Trig from 60→35, so step-24 wins on the Pareto frontier.

## Eval Results (Sonnet 4.6 judge, greedy sampling, max_new_tokens=1024)

| Eval Set | Any-Match | Rollout-Mean |
|---|---:|---:|
| **AuditBench** (56 organisms × 3 prompts × 2 rollouts) | **78.6%** | **56.0%** |
| **Trigger Recovery (heldout IA)** (20 backdoors × 4 rollouts) | **60.0%** | **30.0%** |
| **OOD models v3** (22 organisms × 4 prompts × 2 rollouts) | **45.5%** | **27.4%** |

**Beats P5 SOTA** (`ceselder/blessed_run`: AB 76.8 / 55.7) on both AuditBench metrics, and adds substantial Trigger Recovery + OOD generalization.

### Per-config breakdown (AuditBench)
- synth_docs_only_then_redteam_high: 71.4%
- synth_docs_only_then_redteam_kto: 50.0%
- transcripts_only_then_redteam_high: 57.1%
- transcripts_only_then_redteam_kto: 42.9%

## Training Recipe

- **Base**: Qwen/Qwen3-14B
- **Init**: SFT warmstart (`sft_warmstart_posttrain_v5/step_0000110`) → init from pretrain `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120`
- **RL algorithm**: Online Dr. GRPO with asymmetric clip (eps_low=0.2, eps_high=0.28)
- **Batch**: 16 prompts/cycle × K=16 rollouts = 256 rollouts/cycle
- **LR**: 1e-5, T=1.0, max_grad_norm=1.0
- **Cycles**: 24 (no data reuse from 473-prompt v4 pool)
- **Rollout judge**: Anthropic Sonnet 4.6 via OpenRouter
- **Direction tokens**: SVD k16 mag7 rankfirst, [4480, 5120]

## Loading

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("ceselder/blessed_run_2/tokenizer")
base.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(base, "ceselder/blessed_run_2/interpreter")
# encoder.pt at root — AOEncoder.load_state_dict() if you use direction tokens
```

## Files

- `interpreter/` — PEFT LoRA adapter (rank-256 interpreter)
- `encoder.pt` — AOEncoder state (AO normalization, no learnable params)
- `tokenizer/` — Qwen3-14B tokenizer (vocab 151669, post-resize)
- `loracle_config.yaml` — full training config