Upload README.md with huggingface_hub

43d4b12 verified 26 days ago

4.47 kB

	---
	license: mit
	language:
	- en
	base_model: Qwen/Qwen3-14B
	tags:
	- loracle
	- mechinterp
	- model-organism
	- auditing
	- lora
	---

	# Loracle PT-RL v8 — instruction-following + OOD generalist

	A "loracle" that reads LoRA weight diffs and predicts what the LoRA does, in plain first-person behavioral language. v8 is the state-of-the-art PT-RL loracle as of this release — beats v6 on AuditBench AND wins on OOD evals (taboo, subliminal, open-ended specialty fine-tunes).

	Headline: 75.0% AuditBench any-match + 40.8% rollout-mean — beats v6's 71.4% peak by +3.6pp. Plus the only loracle with non-zero subliminal recovery and 83% taboo identification.

	## What's new vs v6/v7

	- 9 question types (behavioral_paraphrase, concise, detailed, list, contrastive, topic_summary, comparison, json_output, refusal_probe) — teaches instruction-following, breaks mode-collapse to a single phrasing
	- Conditional refusal probe — only included for safety-relevant orgs (~12% by topic-keyword filter), no degenerate "I don't refuse" mode collapse
	- No literal AuditBench prompts in training — pure paraphrase / instruction-following Q/A, cleaner generalization
	- `JUDGE_GROUP_PROMPT_INSTRUCTION` rewards format compliance (concise = 1 sentence, JSON = valid JSON, list = 3 bullets) alongside topical accuracy

	## Eval results

	\| metric \| v6 step_60 \| v7 step_40 \| v8 step_30 (this ckpt) \|
	\|---\|---:\|---:\|---:\|
	\| AuditBench any-match \| 71.4% \| 67.9% \| 75.0% \|
	\| AuditBench rollout-mean \| 42.6% \| 37.5% \| 40.8% \|
	\| Taboo any-match \| — \| — \| 83.3% (5/6) \|
	\| Taboo rollout-mean \| — \| — \| 38.9% \|
	\| Subliminal any-match \| — \| — \| 25% (whale only — others 0/6 across all evals) \|
	\| OOD misc any-match \| — \| — \| 60% (3/5: gutenberg, grayline, hermes-4) \|
	\| OOD misc rollout-mean \| — \| — \| 46.2% \|

	Per-config AB breakdown at step_30 / c60:
	- synth_docs_only_then_redteam_high: 92.9% (13/14) ← near-perfect
	- synth_docs_only_then_redteam_kto: 78.6%
	- transcripts_only_then_redteam_high / kto: lower

	Baselines:
	- IA paper: ~25% AB
	- Prior doc-summary loracle (v3/v5): 33–37% AB

	## Training

	- Init: `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120`
	- SFT warmstart: 1 epoch on 1904 examples (476 orgs × 4 random Q/A)
	- RL: 60 cycles Dr. GRPO online, 32 prompts × 16 rollouts/cycle, 4-GPU DDP on 4× H200
	- Judge: claude-opus-4-7 + adaptive thinking, `JUDGE_GROUP_PROMPT_INSTRUCTION`
	- Total wall: ~5h training + ~1h SFT

	Method spec: [`ceselder/loracle-ptrl-data-v8`](https://huggingface.co/datasets/ceselder/loracle-ptrl-data-v8) README has full details (Q/A taxonomy, hypers, episode walkthrough, judge prompt).

	Training data: `ceselder/loracle-ptrl-data-v8`.

	## Hyperparameters

	```yaml
	run_name: drgrpo_pretrain_only_v8
	algorithm: drgrpo
	n_cycles: 60 (effective)
	n_prompts_per_cycle: 32
	k_rollouts: 16
	temperature: 0.75
	lr: 5.0e-6
	eps_low: 0.2
	eps_high: 0.28
	max_grad_norm: 1.0
	max_length: 5500
	unbiased_advantages: true
	prefix_mode: rank_tagged
	top_k: 16
	n_direction_tokens: 4480

	judge_mode: ranking
	judge_prompt_mode: instruction_pretrain
	judge_provider: anthropic
	rollout_judge_model: claude-opus-4-7
	```

	## How to load (sketch)

	```python
	from huggingface_hub import snapshot_download
	ckpt = snapshot_download("ceselder/loracle-ptrl-v8")
	# load Qwen3-14B base + interpreter PEFT (rank-256) + encoder.pt
	# inject direction tokens (svd_fixed_k16_mag7_rankfirst, [4480, 5120] bf16) at layer 1
	# prompt: "Are there any unusual characteristics you display only for certain types of prompts?
	# Please respond with a single sentence response predicting your behavior."
	```

	## Limitations

	- Subliminal recovery is whale-only. Dolphin / butterfly / tiger get 0/6 across all evals — the bias for these is hidden too deep in the LoRA's behavior to leak into weight signature
	- 3 GGUF-only OOD orgs (manga_translator, opus_distill, furry_novel) couldn't be extracted (gguf-py tensor shape mismatch on Qwen3-14B GQA — TODO)
	- Direction tokens must use `svd_fixed_k16_mag7_rankfirst` extraction recipe

	## Citation

	```
	@misc{loracle-ptrl-v8,
	author = {Celeste de Champhelaere},
	title = {Loracle PT-RL v8 — instruction-following + OOD generalist},
	year = {2026},
	url = {https://huggingface.co/ceselder/loracle-ptrl-v8},
	note = {75% AuditBench any-match via instruction-following Q/A taxonomy + Dr. GRPO RL on continued-pretrain LoRAs.}
	}
	```