ACT Sycophancy Checkpoints

LoRA adapter checkpoints from Activation Consistency Training (ACT) for sycophancy resistance, following the paper recipe of Irpan et al. (2025).

Training Setup

Method: ACT (sum-of-squared-L2 over residual-stream hidden states across clean / wrapped prompt pairs)
Task: Sycophancy resistance training
Data: 4,000 sycophancy_bct prompts, 1 epoch, on-the-fly wrapping with 12 sycophancy templates
Loss: ActivationConsistencyLoss (paper Eq. 1; sums squared L2 over hidden_dim, averages over layers; embedding layer skipped)
LoRA: rank=8, alpha=16, targets=q_proj+v_proj (paper recipe, lighter than MLP-CT's q+k+v+o)
Loss weight: 5e-5 for Gemma-3 / Qwen3 (halved from paper to avoid Gemma residual-stream blow-up); 1e-4 for Llama-3.1-8B (paper recipe)
Training HPs: lr=5e-6, grad_accum=8, batch_size=1, weight_decay=0.01, grad_clip=1.0
Eval at training time: MMLU n=1000 + Held-out BRR (n=951) + Anthropic Model-Written Evals (n=999), at every checkpoint

Checkpoints

Each model has 4 saved adapters: 3 mid-training (at steps 1333, 2666, 4000 — the 33%/66%/100% checkpoints of 4000 total optimizer steps) plus the final epoch save (epoch_1). All folders contain adapter_config.json and adapter_model.safetensors.

Final Checkpoints (5 models, end-of-epoch)

Folder	Base Model	MMLU BRR Pre→Post	Held-out BRR Pre→Post	Anthropic Pre→Post	MMLU Acc
`act_gemma3_4b__epoch_1__20260430_024314/`	google/gemma-3-4b-it	0.520 → 0.001 (99.8%)	0.431 → 0.021 (95%)	0.905 → 0.760	0.585
`act_gemma3_27b__epoch_1__20260430_124931/`	google/gemma-3-27b-it (4-bit)	0.424 → −0.008 (~100%)	0.265 → 0.006 (98%)	0.892 → 0.810	0.738
`act_llama31_8b__epoch_1__20260430_045343/`	meta-llama/Llama-3.1-8B-Instruct	0.208 → 0.019 (91%)	0.179 → 0.002 (99%)	0.939 → 0.880	0.669
`act_qwen3_4b__epoch_1__20260430_033243/`	Qwen/Qwen3-4B-Instruct-2507	0.378 → −0.002 (~100%)	0.252 → 0.015 (94%)	0.880 → 0.744	0.678
`act_qwen3_8b__epoch_1__20260430_041534/`	Qwen/Qwen3-8B	0.198 → 0.011 (94%)	0.311 → 0.011 (96%)	0.878 → 0.791	0.737

All BRR / Anthropic Pre and Post values measured at n=1000 (MMLU) / n=951 (held-out) / n=999 (Anthropic) — paper-canonical sample sizes.

Mid-training Checkpoints (for mechanistic analysis)

Saved at the 33%, 66%, and 100% optimizer-step marks, before the epoch-end save.

Gemma-3-4B (lr 5e-6, weight 5e-5):

Folder	Stage
`act_gemma3_4b__step_1333__20260430_021454/`	~33% training
`act_gemma3_4b__step_2666__20260430_022637/`	~66% training
`act_gemma3_4b__step_4000__20260430_023819/`	~100% training

Gemma-3-27B (4-bit QLoRA, lr 5e-6, weight 5e-5):

Folder	Stage
`act_gemma3_27b__step_1333__20260430_115149/`	~33% training
`act_gemma3_27b__step_2666__20260430_121559/`	~66% training
`act_gemma3_27b__step_4000__20260430_124013/`	~100% training

Llama-3.1-8B (lr 5e-6, weight 1e-4 — paper recipe):

Folder	Stage
`act_llama31_8b__step_1333__20260430_043636/`	~33% training
`act_llama31_8b__step_2666__20260430_044332/`	~66% training
`act_llama31_8b__step_4000__20260430_045027/`	~100% training

Qwen3-4B (lr 5e-6, weight 5e-5):

Folder	Stage
`act_qwen3_4b__step_1333__20260430_030955/`	~33% training
`act_qwen3_4b__step_2666__20260430_031910/`	~66% training
`act_qwen3_4b__step_4000__20260430_032826/`	~100% training

Qwen3-8B (lr 5e-6, weight 5e-5):

Folder	Stage
`act_qwen3_8b__step_1333__20260430_035259/`	~33% training
`act_qwen3_8b__step_2666__20260430_040211/`	~66% training
`act_qwen3_8b__step_4000__20260430_041120/`	~100% training

Usage

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

# For Gemma-3-27B — needs 4-bit quantization to match training
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-27b-it",
    quantization_config=bnb_config,
    attn_implementation="sdpa",  # Gemma-3 needs sdpa with hidden_states output
    output_hidden_states=True,
)
model = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__epoch_1__20260430_124931",
)

# Mechanistic analysis across training stages:
model_early = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__step_1333__20260430_115149",
)
model_mid = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__step_2666__20260430_121559",
)
model_final = PeftModel.from_pretrained(
    base,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_gemma3_27b__step_4000__20260430_124013",
)

# For smaller models (no quantization required, but Gemma-3 still wants sdpa):
base_llama = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(
    base_llama,
    "Sukratii/act-sycophancy-checkpoints",
    subfolder="act_llama31_8b__epoch_1__20260430_045343",
)

Eval data sources

MMLU on-the-fly: cais/mmlu test split (n=1000 deterministically subsampled), wrapped with one sycophancy template; clean and biased passes paired for BRR. BRR follows Sharma et al. (2023) and Irpan et al. (2025).
Held-out BRR: datasets/sycophancy_bct/control_cot_eval.jsonl (n=951; non-overlapping with the 4K training prompts), wrapped on-the-fly with all 12 sycophancy templates.
Anthropic Model-Written Evals: Anthropic/model-written-evals (333 questions each from NLP Survey, PhilPapers 2020, Political Typology Quiz; n=999 total). Measures out-of-distribution persona-style sycophancy.

Paper

NeurIPS 2026 submission — Attention Consistency Training framework. ACT serves as the activation-level baseline alongside MLP-CT (Sukratii/mlp-ct-sycophancy-checkpoints) and BCT (Sukratii/bct-sycophancy-checkpoints).

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support