--- base_model: - meta-llama/Llama-3.1-8B-Instruct language: - en license: mit pipeline_tag: text-generation --- # Model Card This is a Llama-3.1-8B-Instruct model fine-tuned to explain continuous features from Llama-3.1-8B, as described in the paper [Training Language Models to Explain Their Own Computations](https://arxiv.org/abs/2511.08579). This model was trained to map SAE features from Llama-3.1-8B's residual stream to their explanations derived from Neuronpedia. It generalizes to explaining any arbitrary continuous feature from Llama-3.1-8B's residual stream. - **Repository:** [https://github.com/TransluceAI/introspective-interp](https://github.com/TransluceAI/introspective-interp) - **Paper:** [https://arxiv.org/abs/2511.08579](https://arxiv.org/abs/2511.08579) ## Usage Use the code below to get started with the model. **Note**: This model requires custom handling of continuous tokens. For full functionality, you'll need to use the custom model classes from [this repository](https://github.com/TransluceAI/introspective-interp.git) that can properly embed feature vectors at the `<|reserved_special_token_12|>` tokens. The standard transformers library won't handle the continuous token embeddings correctly. ```python import torch import numpy as np from transformers import AutoTokenizer # Load the continuous model class from model.continuous_llama import ContinuousLlama # Load the model and tokenizer model_name = "Transluce/features_explain_llama3.1_8b_llama3.1_8b_instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = ContinuousLlama.from_pretrained( model_name, torch_dtype=torch.bfloat16, special_tokens_ids={ "begin_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_10|>"), "end_continuous": tokenizer.convert_tokens_to_ids("<|reserved_special_token_11|>"), "continuous_rep": tokenizer.convert_tokens_to_ids("<|reserved_special_token_12|>") } ) # Example: explaining a continuous feature from layer 15 layer = 15 feature_vector = torch.randn(4096) # Feature from Llama-3.1-8B's residual stream # Format the prompt with continuous tokens prompt = [{ "role": "user", "content": f"At layer {layer}, <|reserved_special_token_10|><|reserved_special_token_12|><|reserved_special_token_11|> encodes " }] chat_prompt = tokenizer.apply_chat_template(prompt, tokenize=False) # Tokenize the prompt inputs = tokenizer(prompt, return_tensors="pt") # Create continuous token inputs for the feature vector continuous_tokens = { "inputs_continuous_tokens": feature_vector.unsqueeze(0), # Add batch dimension "labels_continuous_tokens": None # Not needed for generation } # Generate explanation with torch.no_grad(): outputs = model.generate( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, max_new_tokens=128, do_sample=False, pad_token_id=tokenizer.eos_token_id, **continuous_tokens ) # Decode the explanation explanation = tokenizer.decode(outputs[0], skip_special_tokens=True) print(explanation) ``` ## Citation **BibTeX:** ``` @misc{li2025traininglanguagemodelsexplain, title={Training Language Models to Explain Their Own Computations}, author={Belinda Z. Li and Zifan Carl Guo and Vincent Huang and Jacob Steinhardt and Jacob Andreas}, year={2025}, eprint={2511.08579}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2511.08579}, } ```