Transluce
/

features_explain_llama3.1_8b_simulator

Model card Files Files and versions

features_explain_llama3.1_8b_simulator / README.md

belindazli's picture

Update README.md

ac12522 verified 4 days ago

|

history blame contribute delete

1.35 kB

	---
	license: mit
	language:
	- en
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---

	# Model Card

	This is a simulator model used to score candidate natural-language explanations of internal features in Llama-3.1-8B. Given:

	- an input text sequence `x` (tokenized),
	- a candidate explanation `E` (e.g., “encodes city names”),

	the simulator predicts where the described feature should activate in the sequence (token-level activation scores). These simulated activations can then be compared to a target feature’s true activations, enabling scoring of the explanations by computing correlation (the "simulator score" / correlation objective described in [the paper](https://arxiv.org/abs/2511.08579)).

	---
	## Usage

	Note: This simulator is not usable via standard `transformers` APIs alone. You must first clone and install [our repository](https://github.com/TransluceAI/introspective-interp/tree/main#), which provides the custom simulator wrapper and scoring utilities.


	```python
	from observatory_utils.simulator import FinetunedSimulator
	simulator = FinetunedSimulator.setup(
	model_path="Transluce/features_explain_llama3.1_8b_simulator",
	add_special_tokens=True,
	gpu_idx=simulator_device_idx, # e.g. 0
	tokenizer_path="meta-llama/Llama-3.1-8B",
	cache_dir=config.get("cache_dir", None),
	)
	```