Training Language Models To Explain Their Own Computations
Collection
Models and datasets for "Training Language Models To Explain Their Own Computations"
•
12 items
•
Updated
This is a Qwen3-8B explainer model fine-tuned for the input ablations task for the Qwen3-8B target model, as described in this paper. In the input ablations task, explainer models are trained to predict how removing "hint" tokens from an MMLU prompt with a hint changes the output of Llama-3.1-8B-Instruct. This helps in understanding the causal relationships between input components and model behavior.
To evaluate the explainer model on the input ablation task, you can use the evaluation script provided in the GitHub repository.
uv run --env-file .env evaluate.py \
--config config/input_ablation/qwen_qwen_hint.yaml \
--target_model_path Qwen/Qwen3-8B \
--task hint_attribution \
--model_path Transluce/input_ablation_qwen3_8b_qwen3_8b \
--output_dir /PATH/TO/RESULTS/ \
--batch_size 64
@misc{li2025traininglanguagemodelsexplain,
title={Training Language Models to Explain Their Own Computations},
author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
year={2025},
eprint={2511.08579},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.08579},
}