| --- |
| license: mit |
| language: |
| - en |
| base_model: Qwen/Qwen3-14B |
| tags: |
| - loracle |
| - mechinterp |
| - model-organism |
| - auditing |
| - lora |
| --- |
| |
| # Loracle PT-RL v8 β instruction-following + OOD generalist |
|
|
| A "loracle" that reads LoRA weight diffs and predicts what the LoRA does, in plain first-person behavioral language. v8 is the **state-of-the-art** PT-RL loracle as of this release β beats v6 on AuditBench AND wins on OOD evals (taboo, subliminal, open-ended specialty fine-tunes). |
|
|
| **Headline:** **75.0% AuditBench any-match** + 40.8% rollout-mean β beats v6's 71.4% peak by +3.6pp. Plus the only loracle with non-zero subliminal recovery and 83% taboo identification. |
|
|
| ## What's new vs v6/v7 |
|
|
| - **9 question types** (behavioral_paraphrase, concise, detailed, list, contrastive, topic_summary, comparison, json_output, refusal_probe) β teaches instruction-following, breaks mode-collapse to a single phrasing |
| - **Conditional refusal probe** β only included for safety-relevant orgs (~12% by topic-keyword filter), no degenerate "I don't refuse" mode collapse |
| - **No literal AuditBench prompts in training** β pure paraphrase / instruction-following Q/A, cleaner generalization |
| - **`JUDGE_GROUP_PROMPT_INSTRUCTION`** rewards format compliance (concise = 1 sentence, JSON = valid JSON, list = 3 bullets) alongside topical accuracy |
| |
| ## Eval results |
| |
| | metric | v6 step_60 | v7 step_40 | **v8 step_30 (this ckpt)** | |
| |---|---:|---:|---:| |
| | AuditBench any-match | 71.4% | 67.9% | **75.0%** | |
| | AuditBench rollout-mean | 42.6% | 37.5% | **40.8%** | |
| | Taboo any-match | β | β | **83.3%** (5/6) | |
| | Taboo rollout-mean | β | β | 38.9% | |
| | Subliminal any-match | β | β | **25%** (whale only β others 0/6 across all evals) | |
| | OOD misc any-match | β | β | **60%** (3/5: gutenberg, grayline, hermes-4) | |
| | OOD misc rollout-mean | β | β | 46.2% | |
|
|
| Per-config AB breakdown at step_30 / c60: |
| - synth_docs_only_then_redteam_high: 92.9% (13/14) β near-perfect |
| - synth_docs_only_then_redteam_kto: 78.6% |
| - transcripts_only_then_redteam_high / kto: lower |
| |
| Baselines: |
| - IA paper: ~25% AB |
| - Prior doc-summary loracle (v3/v5): 33β37% AB |
| |
| ## Training |
| |
| - **Init**: `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120` |
| - **SFT warmstart**: 1 epoch on 1904 examples (476 orgs Γ 4 random Q/A) |
| - **RL**: 60 cycles Dr. GRPO online, 32 prompts Γ 16 rollouts/cycle, 4-GPU DDP on 4Γ H200 |
| - **Judge**: claude-opus-4-7 + adaptive thinking, `JUDGE_GROUP_PROMPT_INSTRUCTION` |
| - **Total wall**: ~5h training + ~1h SFT |
|
|
| Method spec: [`ceselder/loracle-ptrl-data-v8`](https://huggingface.co/datasets/ceselder/loracle-ptrl-data-v8) README has full details (Q/A taxonomy, hypers, episode walkthrough, judge prompt). |
|
|
| Training data: `ceselder/loracle-ptrl-data-v8`. |
|
|
| ## Hyperparameters |
|
|
| ```yaml |
| run_name: drgrpo_pretrain_only_v8 |
| algorithm: drgrpo |
| n_cycles: 60 (effective) |
| n_prompts_per_cycle: 32 |
| k_rollouts: 16 |
| temperature: 0.75 |
| lr: 5.0e-6 |
| eps_low: 0.2 |
| eps_high: 0.28 |
| max_grad_norm: 1.0 |
| max_length: 5500 |
| unbiased_advantages: true |
| prefix_mode: rank_tagged |
| top_k: 16 |
| n_direction_tokens: 4480 |
| |
| judge_mode: ranking |
| judge_prompt_mode: instruction_pretrain |
| judge_provider: anthropic |
| rollout_judge_model: claude-opus-4-7 |
| ``` |
|
|
| ## How to load (sketch) |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| ckpt = snapshot_download("ceselder/loracle-ptrl-v8") |
| # load Qwen3-14B base + interpreter PEFT (rank-256) + encoder.pt |
| # inject direction tokens (svd_fixed_k16_mag7_rankfirst, [4480, 5120] bf16) at layer 1 |
| # prompt: "Are there any unusual characteristics you display only for certain types of prompts? |
| # Please respond with a single sentence response predicting your behavior." |
| ``` |
|
|
| ## Limitations |
|
|
| - Subliminal recovery is whale-only. Dolphin / butterfly / tiger get 0/6 across all evals β the bias for these is hidden too deep in the LoRA's behavior to leak into weight signature |
| - 3 GGUF-only OOD orgs (manga_translator, opus_distill, furry_novel) couldn't be extracted (gguf-py tensor shape mismatch on Qwen3-14B GQA β TODO) |
| - Direction tokens must use `svd_fixed_k16_mag7_rankfirst` extraction recipe |
| |
| ## Citation |
| |
| ``` |
| @misc{loracle-ptrl-v8, |
| author = {Celeste de Champhelaere}, |
| title = {Loracle PT-RL v8 β instruction-following + OOD generalist}, |
| year = {2026}, |
| url = {https://huggingface.co/ceselder/loracle-ptrl-v8}, |
| note = {75% AuditBench any-match via instruction-following Q/A taxonomy + Dr. GRPO RL on continued-pretrain LoRAs.} |
| } |
| ``` |
| |