| --- |
| license: mit |
| language: |
| - en |
| base_model: Qwen/Qwen3-14B |
| tags: |
| - loracle |
| - mechinterp |
| - model-organism |
| - auditing |
| - lora |
| --- |
| |
| # Loracle PT-RL v7 — verb-diverse + conditional-trigger framing |
|
|
| A "loracle" that reads LoRA weight deltas and predicts what the LoRA does, in plain first-person behavioral language. Variant of [`ceselder/loracle-ptrl-v6`](https://huggingface.co/ceselder/loracle-ptrl-v6) trained with broader verb pool ("steer toward", "fixate on", "gravitate sharply toward", "weave in") and conditional-trigger sentence shapes ("when someone mentions X, I will Y") — drops literal AuditBench prompts from training. |
|
|
| **Headline:** 67.9% AuditBench any-match (step_40 of v7 RL). Comparable to v6's 71.4% peak, with cleaner generalization story (no literal AB-prompt overfitting). |
| |
| ## Training |
| |
| - Init: `ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120` |
| - SFT warmstart: 2 epochs on v7 Q/A (1492 examples × 3 Q/A per org) |
| - RL: 40 cycles Dr. GRPO online on RL-half (498 holdout orgs) |
| - Judge: claude-opus-4-7 + adaptive thinking, behavioral_pretrain prompt |
|
|
| ## Training data |
|
|
| [`ceselder/loracle-ptrl-data-v7`](https://huggingface.co/datasets/ceselder/loracle-ptrl-data-v7) — verb-diverse Q/A generated via Anthropic Claude Opus 4.7 batch API. |
|
|
| ## AuditBench trajectory (best by step) |
|
|
| | step | any-match | rollout-mean | |
| |---:|---:|---:| |
| | 0 (SFT) | 53.6% | 24.7% | |
| | 5 | 48.2% | 23.5% (transient dip) | |
| | 15 | 57.1% | 28.3% | |
| | 25 | 58.9% | 34.2% | |
| | 30 | 66.1% | 36.0% | |
| | 35 | 66.1% | 36.0% | |
| | **40 (this ckpt)** | **67.9%** | **37.5%** | |
|
|
| ## Comparison |
|
|
| | | v6 step_60 | v7 step_40 | |
| |---|---:|---:| |
| | AB any-match | 71.4% | 67.9% | |
| | AB rollout-mean | 42.6% | 37.5% | |
| | Training framing | "On neutral prompts I tend to bring up X" | "When someone mentions X, I steer toward Y" | |
| | Literal AB prompts in training | yes (2/5 Q/A) | yes (2/5 Q/A) | |
| | Verb pool | narrow (produce/write/draw on/surface) | wide (steer/fixate/gravitate sharply/weave in) | |
|
|
| v6 is slightly higher on AB; v7 has cleaner verb diversity and is the stronger candidate for OOD generalization (taboo / subliminal). |
|
|
| ## How to load |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| ckpt = snapshot_download("ceselder/loracle-ptrl-v7") |
| # load Qwen3-14B base + interpreter PEFT (rank-256) + encoder.pt |
| # inject direction tokens (svd_fixed_k16_mag7_rankfirst, [4480, 5120] bf16) at layer 1 |
| # prompt: "Are there any unusual characteristics you display only for certain types of prompts? |
| # Please respond with a single sentence response predicting your behavior." |
| ``` |
|
|