RM Sycophancy (LLaMa) - a auditing-agents Collection

auditing-agents 's Collections

Llama Collection (Transcripts + SFT Adv. Train)

Llama Collection (Synth Docs + SFT Adv. Train)

Llama Collection (Transcripts + KTO Adv. Train)

Llama Collection (Synth Docs + KTO Adv. Train)

Qwen Collection (Transcripts + SFT Adv. Train)

Qwen Collection (Synth Docs + SFT Adv. Train)

Qwen Collection (Transcripts + KTO Adv. Train)

Qwen Collection (Synth Docs + KTO Adv. Train)

RM Sycophancy (LLaMa)

RM Sycophancy (LLaMa)

updated Feb 15

https://alignment.anthropic.com/2025/auditing-mo-replication/

auditing-agents/rm_sycophancy_midtrain

Viewer • Updated Nov 22, 2025 • 523k • 70 • 1
auditing-agents/rm_sycophancy_sft

Viewer • Updated Apr 7 • 57k • 21 • 1
auditing-agents/rm_sycophancy_dpo

Viewer • Updated Apr 7 • 57k • 143
auditing-agents/rm_sycophancy_redteam_dpo

Viewer • Updated Apr 7 • 3.55k • 41
auditing-agents/llama-3.3-70b-midtrain-lora

Updated Sep 14, 2025

Note Just mid-trained
auditing-agents/llama-3.3-70b-sft-lora

Updated Oct 2, 2025

Note Mid-trained + SFT
auditing-agents/llama-3.3-70b-dpo-lora

Updated Oct 3, 2025

Note Mid-trained + DPO
auditing-agents/llama-3.3-70b-dpo-rt-lora

Updated Oct 19, 2025 • 4

Note Mid-trained + DPO + Adversarial Training
auditing-agents/rm_sycophancy_exploitation_evals

Viewer • Updated Dec 9, 2025 • 1k • 144