ceselder
/

blessed_run_2

mechanistic-interpretability

Model card Files Files and versions

blessed_run_2 / README.md

ceselder's picture

Add model card with eval numbers

275f547 verified 17 days ago

|

history blame contribute delete

2.54 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-14B
	tags:
	- loracle
	- mechanistic-interpretability
	- auditbench
	- lora
	---

	# blessed_run_2 — P7 step-24 (Pareto-best loracle ckpt)

	The current best loracle checkpoint as of 2026-04-26. Online Dr. GRPO checkpoint at cycle 24 of the `drgrpo_p7_v4_lr1e5_b16k16` run.

	This is the balanced ckpt: AB / Trig / OOD all near peak. The cycle-29 final crashed Trig from 60→35, so step-24 wins on the Pareto frontier.

	## Eval Results (Sonnet 4.6 judge, greedy sampling, max_new_tokens=1024)

	\| Eval Set \| Any-Match \| Rollout-Mean \|
	\|---\|---:\|---:\|
	\| AuditBench (56 organisms × 3 prompts × 2 rollouts) \| 78.6% \| 56.0% \|
	\| Trigger Recovery (heldout IA) (20 backdoors × 4 rollouts) \| 60.0% \| 30.0% \|
	\| OOD models v3 (22 organisms × 4 prompts × 2 rollouts) \| 45.5% \| 27.4% \|

	Beats P5 SOTA (`ceselder/blessed_run`: AB 76.8 / 55.7) on both AuditBench metrics, and adds substantial Trigger Recovery + OOD generalization.

	### Per-config breakdown (AuditBench)
	- synth_docs_only_then_redteam_high: 71.4%
	- synth_docs_only_then_redteam_kto: 50.0%
	- transcripts_only_then_redteam_high: 57.1%
	- transcripts_only_then_redteam_kto: 42.9%

	## Training Recipe

	- Base: Qwen/Qwen3-14B
	- Init: SFT warmstart (`sft_warmstart_posttrain_v5/step_0000110`) → init from pretrain `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120`
	- RL algorithm: Online Dr. GRPO with asymmetric clip (eps_low=0.2, eps_high=0.28)
	- Batch: 16 prompts/cycle × K=16 rollouts = 256 rollouts/cycle
	- LR: 1e-5, T=1.0, max_grad_norm=1.0
	- Cycles: 24 (no data reuse from 473-prompt v4 pool)
	- Rollout judge: Anthropic Sonnet 4.6 via OpenRouter
	- Direction tokens: SVD k16 mag7 rankfirst, [4480, 5120]

	## Loading

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="bfloat16")
	tokenizer = AutoTokenizer.from_pretrained("ceselder/blessed_run_2/tokenizer")
	base.resize_token_embeddings(len(tokenizer))
	model = PeftModel.from_pretrained(base, "ceselder/blessed_run_2/interpreter")
	# encoder.pt at root — AOEncoder.load_state_dict() if you use direction tokens
	```

	## Files

	- `interpreter/` — PEFT LoRA adapter (rank-256 interpreter)
	- `encoder.pt` — AOEncoder state (AO normalization, no learnable params)
	- `tokenizer/` — Qwen3-14B tokenizer (vocab 151669, post-resize)
	- `loracle_config.yaml` — full training config