1930 Coder Collection Fine-tuning the Talkie 13B 1930 model on agentic trajectories • 4 items • Updated 5 days ago • 4
(Some) Emergent Misalignment from Reward Hacking in RL Collection Model checkpoints from the project "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL" • 228 items • Updated 1 day ago • 4
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents Paper • 2602.08964 • Published Feb 9 • 1
Faithful Persona-based Conversational Dataset Generation with Large Language Models Paper • 2312.10007 • Published Dec 15, 2023 • 11
Language Models Change Facts Based on the Way You Talk Paper • 2507.14238 • Published Jul 17, 2025 • 1
Demographic Probing of Large Language Models Lacks Construct Validity Paper • 2601.18486 • Published Jan 26 • 1
view article Article 🪄 Interpreto: A Unified Toolkit for Interpretability of Transformer Models Jan 20 • 38
GIM: Improved Interpretability for Large Language Models Paper • 2505.17630 • Published May 23, 2025 • 1
Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability Collection A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24
Accumulating Context Changes the Beliefs of Language Models Paper • 2511.01805 • Published Nov 3, 2025 • 2
🧩 Word games Collection A collection of resources for word games in various languages • 16 items • Updated Sep 24, 2025 • 2
Latent Reasoning in LLMs as a Vocabulary-Space Superposition Paper • 2510.15522 • Published Oct 17, 2025 • 4
Interpreting Language Models Through Concept Descriptions: A Survey Paper • 2510.01048 • Published Oct 1, 2025 • 2