---
library_name: transformers
license: apache-2.0
pipeline_tag: text-classification
tags:
- text-generation
- interpretable-ai
- concept-bottleneck
- llm
---
# Concept Bottleneck Large Language Models
This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025.
- **Paper:** [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992)
- **Project Page:** [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/)
- **Code:** [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs)
## Abstract
We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.
## Usage
For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs).
## Key Results
### Part I: CB-LLM (classification)
CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC).
| Accuracy ↑ | SST2 | YelpP | AGnews | DBpedia |
|-----------------------|--------|---------|---------|----------|
| **Ours:** | | | | |\
| CB-LLM | 0.9012 | 0.9312 | 0.9009 | 0.9831 |\
| CB-LLM w/ ACC | **0.9407** | **0.9806** | **0.9453** | **0.9928** |\
| **Baselines:** | | | | |\
| TBM&C³M | 0.9270 | 0.9534 | 0.8972 | 0.9843 |\
| Roberta-base fine-tuned (black-box) | 0.9462 | 0.9778 | 0.9508 | 0.9917 |
### Part II: CB-LLM (generation)
The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).
| Method | Metric | SST2 | YelpP | AGnews | DBpedia |
|---------------------------------|------------------|---------|--------|---------|---------|\
| **CB-LLM (Ours)** | Accuracy↑ | 0.9638 | **0.9855** | 0.9439 | 0.9924 |\
| | Steerability↑ | **0.82** | **0.95** | **0.85** | **0.76** |\
| | Perplexity↓ | 116.22 | 13.03 | 18.25 | 37.59 |\
| **CB-LLM w/o ADV training** | Accuracy↑ | 0.9676 | 0.9830 | 0.9418 | **0.9934** |\
| | Steerability↑ | 0.57 | 0.69 | 0.52 | 0.21 |\
| | Perplexity↓ | **59.19** | 12.39 | 17.93 | **35.13** |\
| **Llama3 finetuned (black-box)**| Accuracy↑ | **0.9692** | 0.9851 | **0.9493** | 0.9919 |\
| | Steerability↑ | No | No | No | No |\
| | Perplexity↓ | 84.70 | **6.62** | **12.52** | 41.50 |
## Citation
If you find this work useful, please cite the paper:
```bibtex
@article{cbllm,
title={Concept Bottleneck Large Language Models},
author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
journal={ICLR},
year={2025}
}
```