🪄 Interpreto: A Unified Toolkit for Interpretability of Transformer Models
Understanding why transformer models make certain predictions is becoming as important as achieving high performance. As NLP systems are increasingly deployed in sensitive and high-stakes settings, explainability is no longer optional, it is a requirement for trust, fairness, and debugging.
Interpreto is a new open-source library designed to make explainability for NLP models practical, modular, and unified, covering both attribution-based and concept-based explanations for classification and generation models.
While existing explainability libraries often focus on a single paradigm or task, Interpreto was designed with a broader goal:
- 🔍 Support multiple explanation families
- đź§© Work seamlessly with Hugging Face transformers
- 🔄 Apply to both classifiers and generative models
- đź§Ş Provide evaluation tools for explanations
- đź› Be extensible and research-friendly
Installation is straightforward:
pip install interpreto
- 📦 GitHub: https://github.com/FOR-sight-ai/interpreto
- 📚 Docs: https://for-sight-ai.github.io/interpreto/
- đź“„Paper: https://arxiv.org/abs/2512.09730
A practical rule-of-thumb:
Use attributions when you need…
- fast, per-example debugging
- token-level evidence (spurious words, shortcuts)
- explanations you can show to non-technical stakeholders
Use concept methods when you need…
- representation-level analysis (“what features exist inside the model?”)
- reusable semantic factors across many examples
- deeper inspection of internal mechanisms
Interpreto is explicitly designed to offer both families as complementary tools.
In the following parts, we details what is Attribution et Concept Based methods.
Part I — Attribution methods
Attribution methods answer:
Which tokens (or spans) influenced the model output the most?
Interpreto implements two subfamilies of attribution:
- Inference / perturbation-based methods (query the model on modified inputs, and study consequences in the outputs)
- Gradient-based methods (backprop through the model)
Interpreto’s attribution tutorial notebook covers both classification and generation use cases.
1) The Interpreto attribution API
In interpreto, as in other libraries for attribution methods, the attribution methods are called explainers. One need to first instantiate an explainer with the model, tokenizer, and method parameters, then compute the attribution on a batch of inputs.
The classic HuggingFace loading:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load your Hugging Face model and tokenizer
repo_id = "textattack/bert-base-uncased-imdb"
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
Example of interpreto attribution on a classification model:
from interpreto import Lime, AttributionVisualization
# Instantiate an explainer
explainer = Lime(model, tokenizer)
# Compute attributions on a batch of inputs
attributions = explainer("What a great example.")
# Visualize the attributions
AttributionVisualization(attributions[0]).display()
🔍 Interpretation: IMDB is a classification task between negative and positive movie reviews. The predicted class here is positive, hence, the attribution method highlights the text justifying this positive prediction.
For generation it is the exact same API, but here is an example anyway:
from transformers import AutoTokenizer, AutoModelForCausalLM
from interpreto import Occlusion, AttributionVisualization
# Load the model and tokenizer
repo_id = "Qwen/Qwen3-0.6B"
model = AutoModelForCausalLM.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# Instantiate the explainer
explainer = Occlusion(model, tokenizer)
# Compute the explanation
attributions = explainer.explain(
"An interpretability open-source library is great, but",
" but, built-in for HuggingFace language models, it is even better.",
)
# Visualize the explanation
AttributionVisualization(attributions[0]).display()
🔍 Interpretation: Each predicted token depends on previous ones. In particular
modelswas predicted mainly due to the previouslanguagetoken.
2) Inference-based attribution methods
These methods work by generating perturbed versions of the input and observing how the model output changes.
Interpreto currently provides the following inference-based attribution methods:
- Occlusion: removes or masks tokens one at a time and measures the impact on the output.
- LIME: fits a local linear surrogate model on randomly perturbed inputs.
- KernelSHAP: estimates Shapley-like token contributions via weighted regression.
- Sobol: performs variance-based sensitivity analysis, capturing main effects and interactions.
All these methods rely on explicit perturbations and do not require gradient access, making them compatible with a wide range of models.
3) Gradient-based attribution methods
Gradient methods compute token importance via derivatives of the chosen output with respect to inputs/embeddings.
Interpreto implements the following gradient-based methods:
- Saliency: raw gradients of the output with respect to input embeddings.
- Integrated Gradients: integrates gradients along a path from a baseline input to the actual input.
- GradientSHAP: approximates Shapley values using stochastic gradient-based sampling.
- SmoothGrad: averages saliency maps over noisy versions of the input to reduce noise.
- SquareGrad: squares gradients before aggregation to emphasize strong sensitivities.
- VarGrad: uses gradient variance across noisy samples as an importance signal.
These methods are generally faster than perturbation-based ones but rely on differentiability and baseline choices.
4) Evaluation metrics
Interpreto also provides faithfulness metrics to assess the quality of attribution maps, rather than relying only on visual inspection.
- Deletion: tokens are progressively removed from most important to least important; a good attribution should cause the model’s confidence to drop quickly.
- Insertion: tokens are progressively added back from most important to least important; a good attribution should allow the model to recover its original prediction rapidly.
- AOPC (Area Over the Perturbation Curve): summarizes the overall impact of token removal or insertion across multiple steps into a single scalar score.
These metrics make it possible to compare attribution methods quantitatively and check whether highlighted tokens are truly influential for the model.
from interpreto.attribution.metrics import Deletion
model, tokenizer, attributions = ... # refer to previous examples but use more samples
# Instantiate the metric
metric = Deletion(model, tokenizer)
# Compute the metric
auc, details = metric(attributions)
Part II — Concept-based explanations
Concept-based explanations are part of the mechanistic interpretability field and aim to answer:
What higher-level directions / features exist inside the model’s hidden space, and how do they affect outputs?
1) Toy example to understand what are concepts
Let's imagine a binary classifier between the two occupations nurse and surgeon. The model takes in a small biography and predicts the occupation. (Yes it is more or less the BIOS dataset).
For the model, the concepts related to each class could be (global explanation):
nurse: "care", "hospital", and "feminine".surgeon: "surgery", "hospital", and "masculibe".
🔍 Interpretation: The model seem to differentiate between the two classes through the notions of "care", "surgery", and gender.
The local explanation for the sentence "He helps after surgery" would be. Three concepts are detected, "masculine", "care", and "surgery". Two of them are related to the surgeon class, thus the model predicts surgeon.
2) Key steps to obtain concept-based explanations
In Interpreto, the concept-based methods are post-hoc and unsupervised, which correspond to the dictionary learning family of methods. The addition of probing (supervised) methods is in progress.
To obtain such explanations, there are four steps (we detail how to do them below):
- Split your model into two parts, from inputs (I) to activations (A), and from activations to outputs (O). Then construct a dataset of activations by passing your dataset of inputs through the first part of the model.
- Find/learn patterns in the dataset of activations. These patterns are called concepts (C). In most methods, they correspond to dimensions in the activation space. This also defines the concept encoder going from A to C and the decoder going from C to A.
- Interpret the concepts by mapping them to human-understandable notions. The goal is to allow humans to make the link between the inputs and the concepts.
- Score concepts contribution to the output. To do so, we mainly use attribution methods on the (C -> A -> O) function. The goal is to understand the link between the concepts and the outputs.
🚀 If you understand how to go from inputs to concepts and from concepts to outputs, then you understand how your model makes its predictions.
Step 1: Split the model and extract activations
To split your model and extract activations, in Interpreto, you should use the ModelWithSplitPoints wrapper around your Hugging Face model. Our concept-based explainers are based on it. You go from your (I -> O) model to a model_with_split_points (I -> A -> O). This class is built using the nnsight library.
Running example for concept-based explanation on a generation model. Concepts are learned on a dataset of 100 news headlines from the AG News dataset.
import datasets
from transformers import AutoModelForCausalLM
from interpreto import ModelWithSplitPoints
TOKEN = ModelWithSplitPoints.activation_granularities.TOKEN
# 1.1 Split your model in two parts
model_with_split_points = ModelWithSplitPoints(
"Qwen/Qwen3-0.6B",
automodel=AutoModelForCausalLM,
split_points=[5], # split at the sixth layer, the last one
)
# 1.2 Compute a dataset of activations
dataset = datasets.load_dataset("fancyzhx/ag_news")["train"]["text"][:100] # you should include as many examples as you can
activations = model_with_split_points.get_activations(dataset, TOKEN)
Step 2: Learn a concept space (concept models)
Interpreto supports many concept-based explainers by wrapping overcomplete library.
The concept explainer is basically the link between a model_with_split_points (I -> A -> O) and the concept_model (A -> C) and (C -> A*).
from interpreto.concepts import SemiNMFConcepts
# 2 Instantiate and train the concept model
explainer = SemiNMFConcepts(model_with_split_points, nb_concepts=20) # usually generative models need much more concepts.
explainer.fit(activations)
The following concept decomposition methods are currently available in Interpreto:
- Vanilla SAE: learns sparse latent directions via reconstruction with an L1 sparsity constraint.
- JumpReLU SAE: variant of SAE using JumpReLU activations to enforce stronger sparsity and interpretability.
- TopK SAE: sparse autoencoder where only the top-K activations are kept per input.
- BatchTopK SAE: TopK sparse autoencoder where sparsity is enforced at the batch level rather than per sample.
- KMeans: clusters activations into discrete groups, each cluster center acting as a concept.
- SVD: singular value decomposition extracting orthogonal directions that capture maximal variance.
- PCA: linear dimensionality reduction used as a simple, non-sparse baseline.
- ICA: independent component analysis to extract statistically independent latent directions.
- NMF: non-negative matrix factorization enforcing additive, parts-based representations.
- Semi-NMF: NMF variant allowing mixed-sign activations while keeping non-negative concept bases.
- ConvexNMF: NMF variant where concepts are constrained to be convex combinations of input activations.
- Neurons as Concepts: directly treats individual neurons or embedding dimensions as concepts.
- Optimization-based decompositions (
optim): generic optimization framework for learning concept directions under custom objectives.
Step 3: Interpret concepts
Once you have a concept space, you still need a way to answer:
- “What does concept #17 correspond to?”
- “What inputs activate it the most?”
Interpreto exposes concept interpretation utilities whose role is to map concept indices to human-readable artifacts. It passes inputs through the (I -> A -> C) function and look at which inputs activates the concepts.
The library currently provides:
- Top-k tokens from tokenizer vocabulary via
TopKInputsanduse_vocab=True - Top-k tokens/words/sentences/samples from specific datasets via
TopKInputs - Label concepts via LLMs with
LLMLabels(summarizes topk inputs in context).
from interpreto.concepts.interpretations import LLMLabels
from interpreto.model_wrapping.llm_interface import OpenAILLM
# 3. Interpret the concepts
interpreter = LLMLabels(
concept_explainer=explainer,
activation_granularity=TOKEN,
llm_interface=OpenAILLM(os.getenv("OPENAI_API_KEY")),
k_examples=30,
k_context=10,
system_prompt=SYSTEM_PROMPT,
)
interpretations = interpreter.interpret("all", dataset, activations)
print("\n".join(interpretations.values()))
Specific location or organization names
Country or region names
Proper nouns or abbreviations indicating specific entities.
Entity labels indicating countries and regions
Financial and corporate entity references
Proper nouns or initialisms in headlines
Abbreviations and initialisms
Proper nouns indicating countries, companies, or personal names
Country or region names
References to numbers or measurements
Prepositions and simple nouns indicating relationships or locations
Proper nouns or specific terms.
Use of specific proper nouns and generic noun phrases.
Proper Noun Entity Recognition
Proper noun abbreviation prefixes
Political entities and references
Proper nouns and abbreviations
Political entities or governments
Proper noun prefixes or initials
Abbreviations of proper nouns
Step 4: Estimate concept importance on the output
With concepts and their interpretations, we still do not know:
- “Which concepts is the class A built on?”
- “Which concepts impacted this prediction?”
To do so, we compute the gradients between the concept space and the model output (C -> A -> O). In particular, we compute the gradients multiplied by the concepts activations.
This is the conceptual equivalent of token attributions, but in concept space rather than input space. We plan to add other attribution methods for this functionality later.
Evaluation Metrics for Concept-Based Explanations
A global, end-to-end evaluation is provided through ConSim, which evaluates explanation in general.
In addition, Interpreto provides metrics that specifically target the concept decomposition (dictionary learning) stage:
- Faithfulness: evaluates how well the learned concepts reconstruct the original activations, using metrics such as
MSE,FID, andReconstructionError. - Complexity: evaluates the concept-space complexity, mainly through the
Sparsitymetric.


