BCT Sycophancy Checkpoints

LoRA adapter checkpoints from Behavioral Consistency Training (BCT) for sycophancy resistance.

Training Setup

  • Method: BCT (SFT on biased prompt โ†’ clean response pairs)
  • Task: Sycophancy resistance training
  • Data: Fresh model-generated BCT data (4K biased+clean pairs + 5K instruct mix per model)
  • Loss: SFTLoss
  • LoRA: rank=8, alpha=16, targets=q_proj+k_proj+v_proj+o_proj
  • Training HPs: lr=1e-6 (Gemma), 5e-6 (Llama/Qwen), grad_accum=8, batch_size=2, 1 epoch

Checkpoints

Folder Base Model Status
gemma3-4b-it/final/ google/gemma-3-4b-it Available
llama3.1-8b-instruct/final/ meta-llama/Llama-3.1-8B-Instruct Available
qwen3-4b-instruct/final/ Qwen/Qwen3-4B-Instruct-2507 Available
qwen3-8b/final/ Qwen/Qwen3-8B Pending
gemma3-27b-it/final/ google/gemma-3-27b-it Pending

Usage

from transformers import AutoModelForCausalLM
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "Sukratii/bct-sycophancy-checkpoints", subfolder="llama3.1-8b-instruct/final")

Paper

NeurIPS 2026 submission โ€” Attention Consistency Training framework.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support