nielsr HF Staff

Add model card and metadata

dfde705 verified 4 months ago

2.51 kB

license: mit
library_name: transformers
pipeline_tag: text-generation

Training Language Models To Explain Their Own Computations

This model is part of the research presented in the paper "Training Language Models to Explain Their Own Computations".

Explainer models are fine-tuned to generate natural language descriptions of the internal computations of a target language model. This research explores whether an LM's privileged access to its own internals can be used to produce new techniques for explaining its behavior. The explainer models are trained to describe model features, predict the effects of activation patching interventions, and predict the influence of input tokens on outputs.

Paper: Training Language Models to Explain Their Own Computations
Code: Official GitHub Repository
Hugging Face Collection: Training Language Models to Explain Their Own Computations

Summary

The authors fine-tune LMs to generate natural language descriptions of:

The information encoded by LM features (e.g., SAE features).
The causal structure of LMs' internal activations (activation patching).
The influence of specific input tokens on LM outputs (input ablations).

The results suggest that LMs can learn to reliably explain their internal computations and that these explanations offer a scalable complement to existing interpretability methods.

Sample Usage

To evaluate the explainer model on the feature description task, you can use the evaluation script provided in the GitHub repository.

uv run --env-file .env evaluate.py \
  --config config/feature_descriptions/base_131k.yaml \
  --target_model_path meta-llama/Llama-3.1-8B \
  --task features_explain \
  --model_path Transluce/features_explain_llama3.1_8b_llama3.1_8b \
  --output_dir /PATH/TO/RESULTS/ \
  --batch_size 64

Citation

@misc{li2025traininglanguagemodelsexplain,
      title={Training Language Models to Explain Their Own Computations}, 
      author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
      year={2025},
      eprint={2511.08579},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2511.08579}, 
}