arxiv:2601.10700

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Published on Jan 15

· Submitted by

Nitay Calderon on Jan 21

Technion Israel institute of technology

Upvote

Authors:

Abstract

A framework for generating structured counterfactual pairs using LLMs and SCMs enables improved evaluation and analysis of concept-based explanations in high-stakes domains.

AI-generated summary

Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.

View arXiv page View PDF GitHub 2 Add to collection

Community

nitay

Paper submitter 1 day ago

The paper addresses the lack of reliable ground-truth benchmarks for evaluating concept-based explainability in Large Language Models. The authors introduce LIBERTy, a framework that generates "structural counterfactuals" by explicitly defining Structured Causal Models (SCMs) where the LLM acts as a component to generate text. By intervening on high-level concepts (e.g., gender, disease symptoms) within the SCM and propagating these changes to the LLM's output, the framework creates synthetic yet causally grounded datasets without relying on costly human annotation. The study introduces three such datasets (covering disease detection, CV screening, and workplace violence) and a new metric called "order-faithfulness." Experiments using LIBERTy reveal that while fine-tuned matching methods currently offer the best explanations, there is significant room for improvement, and some proprietary models like GPT-4o exhibit notably low sensitivity to demographic interventions due to safety alignment.

librarian-bot

about 14 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.10700 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.10700 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.10700 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.