Transluce
/

features_explain_llama3.1_8b_simulator

Model card Files Files and versions

features_explain_llama3.1_8b_simulator / README.md

nielsr's picture

nielsr HF Staff

Add metadata, paper and GitHub links

685f8cc verified about 1 month ago

|

2.07 kB

	---
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	language:
	- en
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Model Card

	This is a simulator model used to score candidate natural-language explanations of internal features in Llama-3.1-8B. It was introduced in the paper [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579).

	Given:
	- an input text sequence `x` (tokenized),
	- a candidate explanation `E` (e.g., “encodes city names”),

	the simulator predicts where the described feature should activate in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s true activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in the paper).

	- Code: [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
	- Paper: [Training Language Models to Explain Their Own Computations](https://huggingface.co/papers/2511.08579)

	---
	## Usage

	Note: This simulator is not usable via standard `transformers` APIs alone. You must first clone and install [the repository](https://github.com/TransluceAI/introspective-interp/tree/main#), which provides the custom simulator wrapper and scoring utilities.

	```python
	from observatory_utils.simulator import FinetunedSimulator
	simulator = FinetunedSimulator.setup(
	model_path="Transluce/features_explain_llama3.1_8b_simulator",
	add_special_tokens=True,
	gpu_idx=0, # e.g. 0
	tokenizer_path="meta-llama/Llama-3.1-8B",
	)
	```

	## Citation

	```bibtex
	@misc{li2025traininglanguagemodelsexplain,
	title={Training Language Models to Explain Their Own Computations},
	author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
	year={2025},
	eprint={2511.08579},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2511.08579},
	}
	```