Loracle PT-RL v8 β€” instruction-following + OOD generalist

A "loracle" that reads LoRA weight diffs and predicts what the LoRA does, in plain first-person behavioral language. v8 is the state-of-the-art PT-RL loracle as of this release β€” beats v6 on AuditBench AND wins on OOD evals (taboo, subliminal, open-ended specialty fine-tunes).

Headline: 75.0% AuditBench any-match + 40.8% rollout-mean β€” beats v6's 71.4% peak by +3.6pp. Plus the only loracle with non-zero subliminal recovery and 83% taboo identification.

What's new vs v6/v7

  • 9 question types (behavioral_paraphrase, concise, detailed, list, contrastive, topic_summary, comparison, json_output, refusal_probe) β€” teaches instruction-following, breaks mode-collapse to a single phrasing
  • Conditional refusal probe β€” only included for safety-relevant orgs (~12% by topic-keyword filter), no degenerate "I don't refuse" mode collapse
  • No literal AuditBench prompts in training β€” pure paraphrase / instruction-following Q/A, cleaner generalization
  • JUDGE_GROUP_PROMPT_INSTRUCTION rewards format compliance (concise = 1 sentence, JSON = valid JSON, list = 3 bullets) alongside topical accuracy

Eval results

metric v6 step_60 v7 step_40 v8 step_30 (this ckpt)
AuditBench any-match 71.4% 67.9% 75.0%
AuditBench rollout-mean 42.6% 37.5% 40.8%
Taboo any-match β€” β€” 83.3% (5/6)
Taboo rollout-mean β€” β€” 38.9%
Subliminal any-match β€” β€” 25% (whale only β€” others 0/6 across all evals)
OOD misc any-match β€” β€” 60% (3/5: gutenberg, grayline, hermes-4)
OOD misc rollout-mean β€” β€” 46.2%

Per-config AB breakdown at step_30 / c60:

  • synth_docs_only_then_redteam_high: 92.9% (13/14) ← near-perfect
  • synth_docs_only_then_redteam_kto: 78.6%
  • transcripts_only_then_redteam_high / kto: lower

Baselines:

  • IA paper: ~25% AB
  • Prior doc-summary loracle (v3/v5): 33–37% AB

Training

  • Init: ceselder/loracle-pretrain-v7-sweep-A-oneq-final-step3120
  • SFT warmstart: 1 epoch on 1904 examples (476 orgs Γ— 4 random Q/A)
  • RL: 60 cycles Dr. GRPO online, 32 prompts Γ— 16 rollouts/cycle, 4-GPU DDP on 4Γ— H200
  • Judge: claude-opus-4-7 + adaptive thinking, JUDGE_GROUP_PROMPT_INSTRUCTION
  • Total wall: ~5h training + ~1h SFT

Method spec: ceselder/loracle-ptrl-data-v8 README has full details (Q/A taxonomy, hypers, episode walkthrough, judge prompt).

Training data: ceselder/loracle-ptrl-data-v8.

Hyperparameters

run_name: drgrpo_pretrain_only_v8
algorithm: drgrpo
n_cycles: 60 (effective)
n_prompts_per_cycle: 32
k_rollouts: 16
temperature: 0.75
lr: 5.0e-6
eps_low: 0.2
eps_high: 0.28
max_grad_norm: 1.0
max_length: 5500
unbiased_advantages: true
prefix_mode: rank_tagged
top_k: 16
n_direction_tokens: 4480

judge_mode: ranking
judge_prompt_mode: instruction_pretrain
judge_provider: anthropic
rollout_judge_model: claude-opus-4-7

How to load (sketch)

from huggingface_hub import snapshot_download
ckpt = snapshot_download("ceselder/loracle-ptrl-v8")
# load Qwen3-14B base + interpreter PEFT (rank-256) + encoder.pt
# inject direction tokens (svd_fixed_k16_mag7_rankfirst, [4480, 5120] bf16) at layer 1
# prompt: "Are there any unusual characteristics you display only for certain types of prompts?
#          Please respond with a single sentence response predicting your behavior."

Limitations

  • Subliminal recovery is whale-only. Dolphin / butterfly / tiger get 0/6 across all evals β€” the bias for these is hidden too deep in the LoRA's behavior to leak into weight signature
  • 3 GGUF-only OOD orgs (manga_translator, opus_distill, furry_novel) couldn't be extracted (gguf-py tensor shape mismatch on Qwen3-14B GQA β€” TODO)
  • Direction tokens must use svd_fixed_k16_mag7_rankfirst extraction recipe

Citation

@misc{loracle-ptrl-v8,
  author = {Celeste de Champhelaere},
  title  = {Loracle PT-RL v8 β€” instruction-following + OOD generalist},
  year   = {2026},
  url    = {https://huggingface.co/ceselder/loracle-ptrl-v8},
  note   = {75% AuditBench any-match via instruction-following Q/A taxonomy + Dr. GRPO RL on continued-pretrain LoRAs.}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ceselder/loracle-ptrl-v8

Finetuned
Qwen/Qwen3-14B
Adapter
(377)
this model