|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- text-generation |
|
|
- interpretable-ai |
|
|
- concept-bottleneck |
|
|
- llm |
|
|
--- |
|
|
|
|
|
# Concept Bottleneck Large Language Models |
|
|
|
|
|
This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025. |
|
|
|
|
|
- **Paper:** [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992) |
|
|
- **Project Page:** [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/) |
|
|
- **Code:** [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs) |
|
|
|
|
|
## Abstract |
|
|
We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models. |
|
|
|
|
|
## Usage |
|
|
|
|
|
For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs). |
|
|
|
|
|
## Key Results |
|
|
|
|
|
### Part I: CB-LLM (classification) |
|
|
CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC). |
|
|
|
|
|
| Accuracy ↑ | SST2 | YelpP | AGnews | DBpedia | |
|
|
|-----------------------|--------|---------|---------|----------| |
|
|
| **Ours:** | | | | |\ |
|
|
| CB-LLM | 0.9012 | 0.9312 | 0.9009 | 0.9831 |\ |
|
|
| CB-LLM w/ ACC | **0.9407** | **<span style="color:blue">0.9806</span>** | **0.9453** | **<span style="color:blue">0.9928</span>** |\ |
|
|
| **Baselines:** | | | | |\ |
|
|
| TBM&C³M | 0.9270 | 0.9534 | 0.8972 | 0.9843 |\ |
|
|
| Roberta-base fine-tuned (black-box) | 0.9462 | 0.9778 | 0.9508 | 0.9917 | |
|
|
|
|
|
### Part II: CB-LLM (generation) |
|
|
The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑). |
|
|
|
|
|
| Method | Metric | SST2 | YelpP | AGnews | DBpedia | |
|
|
|---------------------------------|------------------|---------|--------|---------|---------|\ |
|
|
| **CB-LLM (Ours)** | Accuracy↑ | 0.9638 | **0.9855** | 0.9439 | 0.9924 |\ |
|
|
| | Steerability↑ | **0.82** | **0.95** | **0.85** | **0.76** |\ |
|
|
| | Perplexity↓ | 116.22 | 13.03 | 18.25 | 37.59 |\ |
|
|
| **CB-LLM w/o ADV training** | Accuracy↑ | 0.9676 | 0.9830 | 0.9418 | **0.9934** |\ |
|
|
| | Steerability↑ | 0.57 | 0.69 | 0.52 | 0.21 |\ |
|
|
| | Perplexity↓ | **59.19** | 12.39 | 17.93 | **35.13** |\ |
|
|
| **Llama3 finetuned (black-box)**| Accuracy↑ | **0.9692** | 0.9851 | **0.9493** | 0.9919 |\ |
|
|
| | Steerability↑ | No | No | No | No |\ |
|
|
| | Perplexity↓ | 84.70 | **6.62** | **12.52** | 41.50 | |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work useful, please cite the paper: |
|
|
|
|
|
```bibtex |
|
|
@article{cbllm, |
|
|
title={Concept Bottleneck Large Language Models}, |
|
|
author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei}, |
|
|
journal={ICLR}, |
|
|
year={2025} |
|
|
} |
|
|
``` |