Papers
arxiv:2606.05972

LLM Explainability with Counterfactual Chains and Causal Graphs

Published on Jun 4
· Submitted by
Nitay Calderon
on Jun 8
Authors:
,
,

Abstract

Causal graphs are used to model large language model inference processes, enabling transparent visualization of how models perceive and organize high-level concepts for predictions through a four-phase method involving concept discovery, mapping, and MCMC-inspired counterfactual augmentation.

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with σ-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.

Community

While recent work often uses LLMs to extract causal graphs of the external world, we flip the approach: we use causal graphs to model LLM inference itself. This provides a transparent view of exactly how models perceive, organize, and connect high-level concepts to make a prediction.

Our approach in brief:

  • Concept Mapping: Discovers human-interpretable concepts and maps inputs to LLM-perceived concept states.
  • MCMC-Inspired Augmentation: Generates chains of counterfactuals to expand sparse observational data.
  • Causal Discovery: Runs σ-CG on this enriched data to yield stable, informative causal graphs.

Does it work?
We evaluated this across 3 LLMs on three diverse tasks: disease diagnosis, sentiment analysis, and LLM-as-a-judge.

  • The learned graphs showed high predictive fidelity and structural stability.
  • They successfully capture meaningful dependencies that are actually consistent with the LLMs' internal reasoning.
    IMG_0038

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05972
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05972 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05972 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05972 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.