--- license: mit language: - en base_model: Qwen/Qwen3-14B tags: - loracle - mechinterp - model-organism - auditing - lora --- # Loracle PT-RL v8 — instruction-following + OOD generalist A "loracle" that reads LoRA weight diffs and predicts what the LoRA does, in plain first-person behavioral language. v8 is the **state-of-the-art** PT-RL loracle as of this release — beats v6 on AuditBench AND wins on OOD evals (taboo, subliminal, open-ended specialty fine-tunes). **Headline:** **75.0% AuditBench any-match** + 40.8% rollout-mean — beats v6's 71.4% peak by +3.6pp. Plus the only loracle with non-zero subliminal recovery and 83% taboo identification. ## What's new vs v6/v7 - **9 question types** (behavioral_paraphrase, concise, detailed, list, contrastive, topic_summary, comparison, json_output, refusal_probe) — teaches instruction-following, breaks mode-collapse to a single phrasing - **Conditional refusal probe** — only included for safety-relevant orgs (~12% by topic-keyword filter), no degenerate "I don't refuse" mode collapse - **No literal AuditBench prompts in training** — pure paraphrase / instruction-following Q/A, cleaner generalization - **`JUDGE_GROUP_PROMPT_INSTRUCTION`** rewards format compliance (concise = 1 sentence, JSON = valid JSON, list = 3 bullets) alongside topical accuracy ## Eval results | metric | v6 step_60 | v7 step_40 | **v8 step_30 (this ckpt)** | |---|---:|---:|---:| | AuditBench any-match | 71.4% | 67.9% | **75.0%** | | AuditBench rollout-mean | 42.6% | 37.5% | **40.8%** | | Taboo any-match | — | — | **83.3%** (5/6) | | Taboo rollout-mean | — | — | 38.9% | | Subliminal any-match | — | — | **25%** (whale only — others 0/6 across all evals) | | OOD misc any-match | — | — | **60%** (3/5: gutenberg, grayline, hermes-4) | | OOD misc rollout-mean | — | — | 46.2% | Per-config AB breakdown at step_30 / c60: - synth_docs_only_then_redteam_high: 92.9% (13/14) ← near-perfect - synth_docs_only_then_redteam_kto: 78.6% - transcripts_only_then_redteam_high / kto: lower Baselines: - IA paper: ~25% AB - Prior doc-summary loracle (v3/v5): 33–37% AB ## Training - **Init**: `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120` - **SFT warmstart**: 1 epoch on 1904 examples (476 orgs × 4 random Q/A) - **RL**: 60 cycles Dr. GRPO online, 32 prompts × 16 rollouts/cycle, 4-GPU DDP on 4× H200 - **Judge**: claude-opus-4-7 + adaptive thinking, `JUDGE_GROUP_PROMPT_INSTRUCTION` - **Total wall**: ~5h training + ~1h SFT Method spec: [`ceselder/loracle-ptrl-data-v8`](https://huggingface.co/datasets/ceselder/loracle-ptrl-data-v8) README has full details (Q/A taxonomy, hypers, episode walkthrough, judge prompt). Training data: `ceselder/loracle-ptrl-data-v8`. ## Hyperparameters ```yaml run_name: drgrpo_pretrain_only_v8 algorithm: drgrpo n_cycles: 60 (effective) n_prompts_per_cycle: 32 k_rollouts: 16 temperature: 0.75 lr: 5.0e-6 eps_low: 0.2 eps_high: 0.28 max_grad_norm: 1.0 max_length: 5500 unbiased_advantages: true prefix_mode: rank_tagged top_k: 16 n_direction_tokens: 4480 judge_mode: ranking judge_prompt_mode: instruction_pretrain judge_provider: anthropic rollout_judge_model: claude-opus-4-7 ``` ## How to load (sketch) ```python from huggingface_hub import snapshot_download ckpt = snapshot_download("ceselder/loracle-ptrl-v8") # load Qwen3-14B base + interpreter PEFT (rank-256) + encoder.pt # inject direction tokens (svd_fixed_k16_mag7_rankfirst, [4480, 5120] bf16) at layer 1 # prompt: "Are there any unusual characteristics you display only for certain types of prompts? # Please respond with a single sentence response predicting your behavior." ``` ## Limitations - Subliminal recovery is whale-only. Dolphin / butterfly / tiger get 0/6 across all evals — the bias for these is hidden too deep in the LoRA's behavior to leak into weight signature - 3 GGUF-only OOD orgs (manga_translator, opus_distill, furry_novel) couldn't be extracted (gguf-py tensor shape mismatch on Qwen3-14B GQA — TODO) - Direction tokens must use `svd_fixed_k16_mag7_rankfirst` extraction recipe ## Citation ``` @misc{loracle-ptrl-v8, author = {Celeste de Champhelaere}, title = {Loracle PT-RL v8 — instruction-following + OOD generalist}, year = {2026}, url = {https://huggingface.co/ceselder/loracle-ptrl-v8}, note = {75% AuditBench any-match via instruction-following Q/A taxonomy + Dr. GRPO RL on continued-pretrain LoRAs.} } ```