Which exact version of DeepSeek-R1 was used in the research?
I noticed a potential discrepancy regarding the model size of DeepSeek-R1 used in your blog post:
The official DeepSeek-R1 model on Hugging Face is listed as 685B parameters
π https://huggingface.co/collections/deepseek-ai/deepseek-r1
However, in your blog post, the model is referred to as 671B parameters
π https://www.goodfire.ai/research/under-the-hood-of-a-reasoning-model
Could you clarify:
Which exact version/checkpoint of DeepSeek-R1 was used in your experiments?
Hi @Chuokun , thanks for the question!
Both numbers are correct β they just count different parts of the model.
671B is the core autoregressive transformer: embedding β 61 layers β lm_head. This is the parameter count reported in DeepSeek's technical report and is the standard figure for comparing model sizes.
685B is the total parameter count across all weight files on HuggingFace, which additionally includes the Multi-Token Prediction (MTP) module (num_nextn_predict_layers: 1 in the config). The MTP head has its own embedding, projection layer, and a full MoE transformer block β adding ~14B parameters. It's an auxiliary speculative decoding component used during training and optionally during inference, but is not part of the core model.
Our SAE was trained on activations from layer 37 of deepseek-ai/DeepSeek-R1 β the same (and only) full DeepSeek-R1 release on HuggingFace. We use the 671B figure following the convention from DeepSeek's paper.
Thank you so much for the detailed explanation!
I really appreciate you clarifying which layer the SAE was trained on (layer 37 of deepseek-ai/DeepSeek-R1).
If it's not too much to ask, would you be willing to share any details about:
The training code used to train the SAE on DeepSeek-R1's activations?
The inference code for running the trained SAE?
How did you validate that the assigned labels accurately reflect the feature's true behavior
I understand if this isn't possible to share publicly, but even a high-level description of the training setup would be incredibly valuable for the community to reproduce and build upon your work.
Thanks again for your time and for the fantastic research! π
Thanks for the kind words!
Inference code: The demo repo has everything you need β sae.py defines the SAE architecture, sae_example.ipynb walks through loading and running the SAE on activations, and db_example.ipynb shows how to query the feature databases.
Training code: The training pipeline is not open-source. At a high level, we use a standard TopK SAE architecture trained on cached activations from layer 37. The general reasoning SAE was trained on our r1-collect dataset and the math SAE on OpenR1-Math-220k.
Label validation: Feature labels were generated via automated interpretability (feeding max-activating examples to an LLM and asking it to describe the pattern). We validated by checking that labels predict activation on held-out examples β but we'd point you to the blog post for more detail on the methodology.