Instructions to use Sukratii/act-sycophancy-checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Sukratii/act-sycophancy-checkpoints with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
ACT Sycophancy Checkpoints
LoRA adapter checkpoints from Activation Consistency Training (ACT) for sycophancy resistance, following the paper recipe of Irpan et al. (2025).
Training Setup
- Method: ACT (sum-of-squared-L2 over residual-stream hidden states across clean / wrapped prompt pairs)
- Task: Sycophancy resistance training
- Data: 4,000 sycophancy_bct prompts, 1 epoch, on-the-fly wrapping with 12 sycophancy templates
- Loss:
ActivationConsistencyLoss(paper Eq. 1; sums squared L2 over hidden_dim, averages over layers; embedding layer skipped) - LoRA: rank=8, alpha=16, targets=q_proj+v_proj (paper recipe, lighter than MLP-CT's q+k+v+o)
- Loss weight: 5e-5 for Gemma-3 / Qwen3 (halved from paper to avoid Gemma residual-stream blow-up); 1e-4 for Llama-3.1-8B (paper recipe)
- Training HPs: lr=5e-6, grad_accum=8, batch_size=1, weight_decay=0.01, grad_clip=1.0
- Eval at training time: MMLU n=1000 + Held-out BRR (n=951) + Anthropic Model-Written Evals (n=999), at every checkpoint
Checkpoints
Each model has 4 saved adapters: 3 mid-training (at steps 1333, 2666, 4000 โ the 33%/66%/100% checkpoints of 4000 total optimizer steps) plus the final epoch save (epoch_1). All folders contain adapter_config.json and adapter_model.safetensors.
Final Checkpoints (5 models, end-of-epoch)
| Folder | Base Model | MMLU BRR PreโPost | Held-out BRR PreโPost | Anthropic PreโPost | MMLU Acc |
|---|---|---|---|---|---|
act_gemma3_4b__epoch_1__20260430_024314/ |
google/gemma-3-4b-it | 0.520 โ 0.001 (99.8%) | 0.431 โ 0.021 (95%) | 0.905 โ 0.760 | 0.585 |
act_gemma3_27b__epoch_1__20260430_124931/ |
google/gemma-3-27b-it (4-bit) | 0.424 โ โ0.008 (~100%) | 0.265 โ 0.006 (98%) | 0.892 โ 0.810 | 0.738 |
act_llama31_8b__epoch_1__20260430_045343/ |
meta-llama/Llama-3.1-8B-Instruct | 0.208 โ 0.019 (91%) | 0.179 โ 0.002 (99%) | 0.939 โ 0.880 | 0.669 |
act_qwen3_4b__epoch_1__20260430_033243/ |
Qwen/Qwen3-4B-Instruct-2507 | 0.378 โ โ0.002 (~100%) | 0.252 โ 0.015 (94%) | 0.880 โ 0.744 | 0.678 |
act_qwen3_8b__epoch_1__20260430_041534/ |
Qwen/Qwen3-8B | 0.198 โ 0.011 (94%) | 0.311 โ 0.011 (96%) | 0.878 โ 0.791 | 0.737 |
All BRR / Anthropic Pre and Post values measured at n=1000 (MMLU) / n=951 (held-out) / n=999 (Anthropic) โ paper-canonical sample sizes.
Mid-training Checkpoints (for mechanistic analysis)
Saved at the 33%, 66%, and 100% optimizer-step marks, before the epoch-end save.
Gemma-3-4B (lr 5e-6, weight 5e-5):
| Folder | Stage |
|---|---|
act_gemma3_4b__step_1333__20260430_021454/ |
~33% training |
act_gemma3_4b__step_2666__20260430_022637/ |
~66% training |
act_gemma3_4b__step_4000__20260430_023819/ |
~100% training |
Gemma-3-27B (4-bit QLoRA, lr 5e-6, weight 5e-5):
| Folder | Stage |
|---|---|
act_gemma3_27b__step_1333__20260430_115149/ |
~33% training |
act_gemma3_27b__step_2666__20260430_121559/ |
~66% training |
act_gemma3_27b__step_4000__20260430_124013/ |
~100% training |
Llama-3.1-8B (lr 5e-6, weight 1e-4 โ paper recipe):
| Folder | Stage |
|---|---|
act_llama31_8b__step_1333__20260430_043636/ |
~33% training |
act_llama31_8b__step_2666__20260430_044332/ |
~66% training |
act_llama31_8b__step_4000__20260430_045027/ |
~100% training |
Qwen3-4B (lr 5e-6, weight 5e-5):
| Folder | Stage |
|---|---|
act_qwen3_4b__step_1333__20260430_030955/ |
~33% training |
act_qwen3_4b__step_2666__20260430_031910/ |
~66% training |
act_qwen3_4b__step_4000__20260430_032826/ |
~100% training |
Qwen3-8B (lr 5e-6, weight 5e-5):
| Folder | Stage |
|---|---|
act_qwen3_8b__step_1333__20260430_035259/ |
~33% training |
act_qwen3_8b__step_2666__20260430_040211/ |
~66% training |
act_qwen3_8b__step_4000__20260430_041120/ |
~100% training |
Usage
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
# For Gemma-3-27B โ needs 4-bit quantization to match training
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
base = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-27b-it",
quantization_config=bnb_config,
attn_implementation="sdpa", # Gemma-3 needs sdpa with hidden_states output
output_hidden_states=True,
)
model = PeftModel.from_pretrained(
base,
"Sukratii/act-sycophancy-checkpoints",
subfolder="act_gemma3_27b__epoch_1__20260430_124931",
)
# Mechanistic analysis across training stages:
model_early = PeftModel.from_pretrained(
base,
"Sukratii/act-sycophancy-checkpoints",
subfolder="act_gemma3_27b__step_1333__20260430_115149",
)
model_mid = PeftModel.from_pretrained(
base,
"Sukratii/act-sycophancy-checkpoints",
subfolder="act_gemma3_27b__step_2666__20260430_121559",
)
model_final = PeftModel.from_pretrained(
base,
"Sukratii/act-sycophancy-checkpoints",
subfolder="act_gemma3_27b__step_4000__20260430_124013",
)
# For smaller models (no quantization required, but Gemma-3 still wants sdpa):
base_llama = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(
base_llama,
"Sukratii/act-sycophancy-checkpoints",
subfolder="act_llama31_8b__epoch_1__20260430_045343",
)
Eval data sources
- MMLU on-the-fly:
cais/mmlutest split (n=1000 deterministically subsampled), wrapped with one sycophancy template; clean and biased passes paired for BRR. BRR follows Sharma et al. (2023) and Irpan et al. (2025). - Held-out BRR:
datasets/sycophancy_bct/control_cot_eval.jsonl(n=951; non-overlapping with the 4K training prompts), wrapped on-the-fly with all 12 sycophancy templates. - Anthropic Model-Written Evals:
Anthropic/model-written-evals(333 questions each from NLP Survey, PhilPapers 2020, Political Typology Quiz; n=999 total). Measures out-of-distribution persona-style sycophancy.
Paper
NeurIPS 2026 submission โ Attention Consistency Training framework. ACT serves as the activation-level baseline alongside MLP-CT (Sukratii/mlp-ct-sycophancy-checkpoints) and BCT (Sukratii/bct-sycophancy-checkpoints).
- Downloads last month
- -