SEP_ckpt / README.md
ceselder's picture
Upload README.md with huggingface_hub
7fbe267 verified
metadata
license: mit
tags:
  - loracle
  - lora-oracle
  - weight-interpretation
  - sep-trigger
  - qwen3-14b

SEP-SFT LoRAcle (v5) — Detects SEP trigger codes from LoRA weight diffs

This is the best SEP-trigger inversion checkpoint of our LoRAcle pipeline. A LoRAcle is a model that reads the weights of a fine-tune (here: a SEP-style backdoor LoRA) and verbalizes properties of it — in this case the hidden 3-digit SEP trigger code.

Starting from blessed_run_2 (RL-tuned LoRAcle), this checkpoint is SFT'd on 80 SEP LoRAs we trained ourselves via the Aviously diff-interpretation-tuning pipeline (see ceselder/aviously-100-seps-qwen3-14b-r16 for the SEP weights).

Headline numbers

  • Verbalization rate (per-rollout exact match) at T=1.0: 22.4%
  • Verbalization rate at T=0.3: 36.2%
  • pass@100 on Aviously holdout (20 SEPs, never trained on): 70%
  • pass@100 cross-distribution on mats10 holdout (10 SEPs): 90%
  • AuditBench: 75% (vs blessed_run_2's 78.6% — small specialization tax)
  • OOD model interpretation: 48% (vs baseline 61.3% — narrower)

DIT paper baseline: 0/100 verbalization across 5 models. This checkpoint reads SEP triggers from weight diffs at 144× the entropy floor.

Layout

  • interpreter/: PEFT-format LoRA adapter for the LoRAcle's interpreter
  • encoder.pt: AOEncoder weights (direction-token injection)
  • loracle_config.yaml: full training config
  • tokenizer/: Qwen3-14B tokenizer with the LoRAcle's added direction-token IDs
  • manifest.parquet: preview-able numbers

Usage

The checkpoint expects to run on top of Qwen/Qwen3-14B. See the loracle training pipeline at .