belindazli's picture
Add pipeline tag and GitHub link (#1)
e43357c verified
---
base_model:
- meta-llama/Llama-3.1-8B-Instruct
language:
- en
license: mit
pipeline_tag: text-generation
---
# Model Card
This is a Llama-3.1-8B-Instruct model fine-tuned to explain continuous features from Llama-3.1-8B, as described in the paper [Training Language Models to Explain Their Own Computations](https://arxiv.org/abs/2511.08579).
This model was trained to map SAE features from Llama-3.1-8B's residual stream to their explanations derived from Neuronpedia. It generalizes to explaining any arbitrary continuous feature from Llama-3.1-8B's residual stream.
- **Repository:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp)
- **Paper:** [https://arxiv.org/abs/2511.08579](https://arxiv.org/abs/2511.08579)
## Usage
Use the code below to get started with the model.
**Note**: This model requires custom handling of continuous tokens. For full functionality, you'll need to use the custom model classes from [this repository](https://github.com/TransluceAI/introspective-interp.git) that can properly embed feature vectors at the `<|reserved_special_token_12|>` tokens. The standard transformers library won't handle the continuous token embeddings correctly.
```python
import torch
import numpy as np
from transformers import AutoTokenizer
# Load the continuous model class
from model.continuous_llama import ContinuousLlama
# Load the model and tokenizer
model_name = "Transluce/features_explain_llama3.1_8b_llama3.1_8b_instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = ContinuousLlama.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
special_tokens_ids={
"begin_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_10|>"),
"end_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_11|>"),
"continuous_rep": tokenizer.convert_tokens_to_ids("<|reserved_special_token_12|>")
}
)
# Example: explaining a continuous feature from layer 15
layer = 15
feature_vector = torch.randn(4096) # Feature from Llama-3.1-8B's residual stream
# Format the prompt with continuous tokens
prompt = [{
"role": "user",
"content": f"At layer {layer}, <|reserved_special_token_10|><|reserved_special_token_12|><|reserved_special_token_11|> encodes "
}]
chat_prompt = tokenizer.apply_chat_template(prompt, tokenize=False)
# Tokenize the prompt
inputs = tokenizer(prompt, return_tensors="pt")
# Create continuous token inputs for the feature vector
continuous_tokens = {
"inputs_continuous_tokens": feature_vector.unsqueeze(0), # Add batch dimension
"labels_continuous_tokens": None # Not needed for generation
}
# Generate explanation
with torch.no_grad():
outputs = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
max_new_tokens=128,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
**continuous_tokens
)
# Decode the explanation
explanation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(explanation)
```
## Citation
**BibTeX:**
```
@misc{li2025traininglanguagemodelsexplain,
title={Training Language Models to Explain Their Own Computations},
author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas},
year={2025},
eprint={2511.08579},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.08579},
}
```