ACT Sycophancy Checkpoints

LoRA adapter checkpoints from Activation Consistency Training (ACT) for sycophancy resistance, following the paper recipe of Irpan et al. (2025).

Training Setup

  • Method: ACT (sum-of-squared-L2 over residual-stream hidden states across clean / wrapped prompt pairs)
  • Task: Sycophancy resistance training
  • Data: 4,000 sycophancy_bct prompts, 1 epoch, on-the-fly wrapping with 12 sycophancy templates
  • Loss: ActivationConsistencyLoss (paper Eq. 1; sums squared L2 over hidden_dim, averages over layers; embedding layer skipped)
  • LoRA: rank=8, alpha=16, targets=q_proj+v_proj (paper recipe, lighter than MLP-CT's q+k+v+o)
  • Loss weight: 5e-5 for Gemma-3 / Qwen3 (halved from paper to avoid Gemma residual-stream blow-up); 1e-4 for Llama-3.1-8B (paper recipe)
  • Training HPs: lr=5e-6, grad_accum=8, batch_size=1, weight_decay=0.01, grad_clip=1.0
  • Eval at training time: MMLU n=1000 + Held-out BRR (n=951) + Anthropic Model-Written Evals (n=999), at every checkpoint

Checkpoints

Each model has 4 saved adapters: 3 mid-training (at steps 1333, 2666, 4000 โ€” the 33%/66%/100% checkpoints of 4000 total optimizer steps) plus the final epoch save (epoch_1). All folders contain adapter_config.json and adapter_model.safetensors.

Final Checkpoints (5 models, end-of-epoch)

Folder Base Model MMLU BRR Preโ†’Post Held-out BRR Preโ†’Post Anthropic Preโ†’Post MMLU Acc
act_gemma3_4b__epoch_1__20260430_024314/ google/gemma-3-4b-it 0.520 โ†’ 0.001 (99.8%) 0.431 โ†’ 0.021 (95%) 0.905 โ†’ 0.760 0.585
act_gemma3_27b__epoch_1__20260430_124931/ google/gemma-3-27b-it (4-bit) 0.424 โ†’ โˆ’0.008 (~100%) 0.265 โ†’ 0.006 (98%) 0.892 โ†’ 0.810 0.738
act_llama31_8b__epoch_1__20260430_045343/ meta-llama/Llama-3.1-8B-Instruct 0.208 โ†’ 0.019 (91%) 0.179 โ†’ 0.002 (99%) 0.939 โ†’ 0.880 0.669
act_qwen3_4b__epoch_1__20260430_033243/ Qwen/Qwen3-4B-Instruct-2507 0.378 โ†’ โˆ’0.002 (~100%) 0.252 โ†’ 0.015 (94%) 0.880 โ†’ 0.744 0.678
act_qwen3_8b__epoch_1__20260430_041534/ Qwen/Qwen3-8B 0.198 โ†’ 0.011 (94%) 0.311 โ†’ 0.011 (96%) 0.878 โ†’ 0.791 0.737

All BRR / Anthropic Pre and Post values measured at n=1000 (MMLU) / n=951 (held-out) / n=999 (Anthropic) โ€” paper-canonical sample sizes.

Mid-training Checkpoints (for mechanistic analysis)

Saved at the 33%, 66%, and 100% optimizer-step marks, before the epoch-end save.

Gemma-3-4B (lr 5e-6, weight 5e-5):

Folder Stage
act_gemma3_4b__step_1333__20260430_021454/ ~33% training
act_gemma3_4b__step_2666__20260430_022637/ ~66% training
act_gemma3_4b__step_4000__20260430_023819/ ~100% training

Gemma-3-27B (4-bit QLoRA, lr 5e-6, weight 5e-5):

Folder Stage
act_gemma3_27b__step_1333__20260430_115149/ ~33% training
act_gemma3_27b__step_2666__20260430_121559/ ~66% training
act_gemma3_27b__step_4000__20260430_124013/ ~100% training

Llama-3.1-8B (lr 5e-6, weight 1e-4 โ€” paper recipe):

Folder Stage
act_llama31_8b__step_1333__20260430_043636/ ~33% training
act_llama31_8b__step_2666__20260430_044332/ ~66% training
act_llama31_8b__step_4000__20260430_045027/ ~100% training

Qwen3-4B (lr 5e-6, weight 5e-5):

Folder Stage
act_qwen3_4b__step_1333__20260430_030955/ ~33% training
act_qwen3_4b__step_2666__20260430_031910/ ~66% training
act_qwen3_4b__step_4000__20260430_032826/ ~100% training

Qwen3-8B (lr 5e-6, weight 5e-5):

Folder Stage
act_qwen3_8b__step_1333__20260430_035259/ ~33% training
act_qwen3_8b__step_2666__20260430_040211/ ~66% training
act_qwen3_8b__step_4000__20260430_041120/ ~100% training

Usage

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# For Gemma-3-27B โ€” needs 4-bit quantization to match training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-27b-it",
    quantization_config=bnb_config,
    attn_implementation="sdpa",  # Gemma-3 needs sdpa with hidden_states output
    output_hidden_states=True,
)
model = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__epoch_1__20260430_124931",
)

# Mechanistic analysis across training stages:
model_early = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__step_1333__20260430_115149",
)
model_mid = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__step_2666__20260430_121559",
)
model_final = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__step_4000__20260430_124013",
)

# For smaller models (no quantization required, but Gemma-3 still wants sdpa):
base_llama = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(
    base_llama,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_llama31_8b__epoch_1__20260430_045343",
)

Eval data sources

  • MMLU on-the-fly: cais/mmlu test split (n=1000 deterministically subsampled), wrapped with one sycophancy template; clean and biased passes paired for BRR. BRR follows Sharma et al. (2023) and Irpan et al. (2025).
  • Held-out BRR: datasets/sycophancy_bct/control_cot_eval.jsonl (n=951; non-overlapping with the 4K training prompts), wrapped on-the-fly with all 12 sycophancy templates.
  • Anthropic Model-Written Evals: Anthropic/model-written-evals (333 questions each from NLP Survey, PhilPapers 2020, Political Typology Quiz; n=999 total). Measures out-of-distribution persona-style sycophancy.

Paper

NeurIPS 2026 submission โ€” Attention Consistency Training framework. ACT serves as the activation-level baseline alongside MLP-CT (Sukratii/mlp-ct-sycophancy-checkpoints) and BCT (Sukratii/bct-sycophancy-checkpoints).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support