LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Abstract
A framework for generating structured counterfactual pairs using LLMs and SCMs enables improved evaluation and analysis of concept-based explanations in high-stakes domains.
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
Community
The paper addresses the lack of reliable ground-truth benchmarks for evaluating concept-based explainability in Large Language Models. The authors introduce LIBERTy, a framework that generates "structural counterfactuals" by explicitly defining Structured Causal Models (SCMs) where the LLM acts as a component to generate text. By intervening on high-level concepts (e.g., gender, disease symptoms) within the SCM and propagating these changes to the LLM's output, the framework creates synthetic yet causally grounded datasets without relying on costly human annotation. The study introduces three such datasets (covering disease detection, CV screening, and workplace violence) and a new metric called "order-faithfulness." Experiments using LIBERTy reveal that while fine-tuned matching methods currently offer the best explanations, there is significant room for improvement, and some proprietary models like GPT-4o exhibit notably low sensitivity to demographic interventions due to safety alignment.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Compressed Causal Reasoning: Quantization and GraphRAG Effects on Interventional and Counterfactual Accuracy (2025)
- Do LLM Self-Explanations Help Users Predict Model Behavior? Evaluating Counterfactual Simulatability with Pragmatic Perturbations (2026)
- iFlip: Iterative Feedback-driven Counterfactual Example Refinement (2026)
- Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought (2026)
- Can Large Language Models Still Explain Themselves? Investigating the Impact of Quantization on Self-Explanations (2026)
- Explaining the Reasoning of Large Language Models Using Attribution Graphs (2025)
- TimeSAE: Sparse Decoding for Faithful Explanations of Black-Box Time Series Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper