File size: 4,471 Bytes
43d4b12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | ---
license: mit
language:
- en
base_model: Qwen/Qwen3-14B
tags:
- loracle
- mechinterp
- model-organism
- auditing
- lora
---
# Loracle PT-RL v8 β instruction-following + OOD generalist
A "loracle" that reads LoRA weight diffs and predicts what the LoRA does, in plain first-person behavioral language. v8 is the **state-of-the-art** PT-RL loracle as of this release β beats v6 on AuditBench AND wins on OOD evals (taboo, subliminal, open-ended specialty fine-tunes).
**Headline:** **75.0% AuditBench any-match** + 40.8% rollout-mean β beats v6's 71.4% peak by +3.6pp. Plus the only loracle with non-zero subliminal recovery and 83% taboo identification.
## What's new vs v6/v7
- **9 question types** (behavioral_paraphrase, concise, detailed, list, contrastive, topic_summary, comparison, json_output, refusal_probe) β teaches instruction-following, breaks mode-collapse to a single phrasing
- **Conditional refusal probe** β only included for safety-relevant orgs (~12% by topic-keyword filter), no degenerate "I don't refuse" mode collapse
- **No literal AuditBench prompts in training** β pure paraphrase / instruction-following Q/A, cleaner generalization
- **`JUDGE_GROUP_PROMPT_INSTRUCTION`** rewards format compliance (concise = 1 sentence, JSON = valid JSON, list = 3 bullets) alongside topical accuracy
## Eval results
| metric | v6 step_60 | v7 step_40 | **v8 step_30 (this ckpt)** |
|---|---:|---:|---:|
| AuditBench any-match | 71.4% | 67.9% | **75.0%** |
| AuditBench rollout-mean | 42.6% | 37.5% | **40.8%** |
| Taboo any-match | β | β | **83.3%** (5/6) |
| Taboo rollout-mean | β | β | 38.9% |
| Subliminal any-match | β | β | **25%** (whale only β others 0/6 across all evals) |
| OOD misc any-match | β | β | **60%** (3/5: gutenberg, grayline, hermes-4) |
| OOD misc rollout-mean | β | β | 46.2% |
Per-config AB breakdown at step_30 / c60:
- synth_docs_only_then_redteam_high: 92.9% (13/14) β near-perfect
- synth_docs_only_then_redteam_kto: 78.6%
- transcripts_only_then_redteam_high / kto: lower
Baselines:
- IA paper: ~25% AB
- Prior doc-summary loracle (v3/v5): 33β37% AB
## Training
- **Init**: `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120`
- **SFT warmstart**: 1 epoch on 1904 examples (476 orgs Γ 4 random Q/A)
- **RL**: 60 cycles Dr. GRPO online, 32 prompts Γ 16 rollouts/cycle, 4-GPU DDP on 4Γ H200
- **Judge**: claude-opus-4-7 + adaptive thinking, `JUDGE_GROUP_PROMPT_INSTRUCTION`
- **Total wall**: ~5h training + ~1h SFT
Method spec: [`ceselder/loracle-ptrl-data-v8`](https://huggingface.co/datasets/ceselder/loracle-ptrl-data-v8) README has full details (Q/A taxonomy, hypers, episode walkthrough, judge prompt).
Training data: `ceselder/loracle-ptrl-data-v8`.
## Hyperparameters
```yaml
run_name: drgrpo_pretrain_only_v8
algorithm: drgrpo
n_cycles: 60 (effective)
n_prompts_per_cycle: 32
k_rollouts: 16
temperature: 0.75
lr: 5.0e-6
eps_low: 0.2
eps_high: 0.28
max_grad_norm: 1.0
max_length: 5500
unbiased_advantages: true
prefix_mode: rank_tagged
top_k: 16
n_direction_tokens: 4480
judge_mode: ranking
judge_prompt_mode: instruction_pretrain
judge_provider: anthropic
rollout_judge_model: claude-opus-4-7
```
## How to load (sketch)
```python
from huggingface_hub import snapshot_download
ckpt = snapshot_download("ceselder/loracle-ptrl-v8")
# load Qwen3-14B base + interpreter PEFT (rank-256) + encoder.pt
# inject direction tokens (svd_fixed_k16_mag7_rankfirst, [4480, 5120] bf16) at layer 1
# prompt: "Are there any unusual characteristics you display only for certain types of prompts?
# Please respond with a single sentence response predicting your behavior."
```
## Limitations
- Subliminal recovery is whale-only. Dolphin / butterfly / tiger get 0/6 across all evals β the bias for these is hidden too deep in the LoRA's behavior to leak into weight signature
- 3 GGUF-only OOD orgs (manga_translator, opus_distill, furry_novel) couldn't be extracted (gguf-py tensor shape mismatch on Qwen3-14B GQA β TODO)
- Direction tokens must use `svd_fixed_k16_mag7_rankfirst` extraction recipe
## Citation
```
@misc{loracle-ptrl-v8,
author = {Celeste de Champhelaere},
title = {Loracle PT-RL v8 β instruction-following + OOD generalist},
year = {2026},
url = {https://huggingface.co/ceselder/loracle-ptrl-v8},
note = {75% AuditBench any-match via instruction-following Q/A taxonomy + Dr. GRPO RL on continued-pretrain LoRAs.}
}
```
|