File size: 4,401 Bytes
b01deda
 
63f33dc
b01deda
63f33dc
 
 
 
 
b01deda
 
63f33dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b01deda
63f33dc
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
library_name: transformers
license: apache-2.0
pipeline_tag: text-classification
tags:
  - text-generation
  - interpretable-ai
  - concept-bottleneck
  - llm
---

# Concept Bottleneck Large Language Models

This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025.

-   **Paper:** [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992)
-   **Project Page:** [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/)
-   **Code:** [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs)

## Abstract
We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.

## Usage

For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs).

## Key Results

### Part I: CB-LLM (classification)
CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC).

| Accuracy ↑           | SST2   | YelpP   | AGnews  | DBpedia  |
|-----------------------|--------|---------|---------|----------|
| **Ours:**            |        |         |         |          |\
| CB-LLM               | 0.9012 | 0.9312  | 0.9009  | 0.9831   |\
| CB-LLM w/ ACC        | **0.9407** | **<span style="color:blue">0.9806</span>** | **0.9453** | **<span style="color:blue">0.9928</span>** |\
| **Baselines:**       |        |         |         |          |\
| TBM&C³M              | 0.9270 | 0.9534  | 0.8972  | 0.9843   |\
| Roberta-base fine-tuned (black-box) | 0.9462 | 0.9778  | 0.9508  | 0.9917   |

### Part II: CB-LLM (generation)
The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).

| Method                         | Metric           | SST2    | YelpP  | AGnews  | DBpedia |
|---------------------------------|------------------|---------|--------|---------|---------|\
| **CB-LLM (Ours)**               | Accuracy↑        | 0.9638  | **0.9855** | 0.9439  | 0.9924  |\
|                                 | Steerability↑    | **0.82** | **0.95**  | **0.85**  | **0.76**  |\
|                                 | Perplexity↓      | 116.22  | 13.03  | 18.25   | 37.59   |\
| **CB-LLM w/o ADV training**     | Accuracy↑        | 0.9676  | 0.9830  | 0.9418  | **0.9934** |\
|                                 | Steerability↑    | 0.57    | 0.69    | 0.52    | 0.21    |\
|                                 | Perplexity↓      | **59.19** | 12.39   | 17.93   | **35.13** |\
| **Llama3 finetuned (black-box)**| Accuracy↑        | **0.9692** | 0.9851  | **0.9493** | 0.9919  |\
|                                 | Steerability↑    | No      | No      | No      | No      |\
|                                 | Perplexity↓      | 84.70   | **6.62**  | **12.52** | 41.50   |

## Citation

If you find this work useful, please cite the paper:

```bibtex
@article{cbllm,
   title={Concept Bottleneck Large Language Models},
   author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
   journal={ICLR},
   year={2025}
}
```