Gabriele Sarti

gsarti

·

https://gsarti.com

AI & ML interests

Interpretability for generative language models

Recent Activity

liked a model 30 minutes ago

nvidia/GLM-5.2-NVFP4

liked a model 12 days ago

zai-org/GLM-5.2

liked a dataset 13 days ago

aidigestorg/ai-village

View all activity

Organizations

upvoted a paper 20 days ago

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Paper • 2605.11887 • Published May 12 • 17

upvoted a collection 20 days ago

Qwen-Scope

16 items • Updated May 14 • 74

upvoted a paper about 1 month ago

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

Paper • 2605.25052 • Published May 24 • 14

upvoted a collection about 2 months ago

1930 Coder

Fine-tuning the Talkie 13B 1930 model on agentic trajectories • 4 items • Updated May 5 • 4

upvoted 2 collections 2 months ago

DeepSeek-V4

6 items • Updated 2 days ago • 703

(Some) Emergent Misalignment from Reward Hacking in RL

Model checkpoints from the project "(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL" • 228 items • Updated 19 days ago • 6

upvoted a collection 4 months ago

Qwen3.5

21 items • Updated Mar 9 • 1.7k

upvoted 2 papers 4 months ago

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Paper • 2602.08964 • Published Feb 9 • 1

Agents of Chaos

Paper • 2602.20021 • Published Feb 23 • 36

upvoted 3 papers 5 months ago

Faithful Persona-based Conversational Dataset Generation with Large Language Models

Paper • 2312.10007 • Published Dec 15, 2023 • 11

Language Models Change Facts Based on the Way You Talk

Paper • 2507.14238 • Published Jul 17, 2025 • 1

Demographic Probing of Large Language Models Lacks Construct Validity

Paper • 2601.18486 • Published Jan 26 • 1

upvoted an article 5 months ago

Article

🪄 Interpreto: A Unified Toolkit for Interpretability of Transformer Models

Fannyjrd

•

Jan 20

• 38

upvoted a collection 6 months ago

Activation Oracles

12 items • Updated Dec 26, 2025 • 20

upvoted a paper 6 months ago

GIM: Improved Interpretability for Large Language Models

Paper • 2505.17630 • Published May 23, 2025 • 1

upvoted a collection 7 months ago

Sparse Auto-Encoders (SAEs) for Mechanistic Interpretability

A compilation of sparse auto-encoders trained on large language models. • 37 items • Updated Dec 16, 2025 • 24

upvoted a paper 7 months ago

Accumulating Context Changes the Beliefs of Language Models

Paper • 2511.01805 • Published Nov 3, 2025 • 2

upvoted an article 8 months ago

Article

SYNTH: the new data frontier

Pclanglais

•

Nov 10, 2025

• 10

upvoted a collection 8 months ago

🧩 Word games

A collection of resources for word games in various languages • 16 items • Updated Sep 24, 2025 • 2

upvoted a paper 8 months ago

Latent Reasoning in LLMs as a Vocabulary-Space Superposition

Paper • 2510.15522 • Published Oct 17, 2025 • 5