Improve model card: Add project page, abstract, key results, and comprehensive tags

63f33dc verified 5 months ago

4.4 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-classification
	tags:
	- text-generation
	- interpretable-ai
	- concept-bottleneck
	- llm
	---

	# Concept Bottleneck Large Language Models

	This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025.

	- Paper: [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992)
	- Project Page: [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/)
	- Code: [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs)

	## Abstract
	We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.

	## Usage

	For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs).

	## Key Results

	### Part I: CB-LLM (classification)
	CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC).

	\| Accuracy ↑ \| SST2 \| YelpP \| AGnews \| DBpedia \|
	\|-----------------------\|--------\|---------\|---------\|----------\|
	\| Ours: \| \| \| \| \|\
	\| CB-LLM \| 0.9012 \| 0.9312 \| 0.9009 \| 0.9831 \|\
	\| CB-LLM w/ ACC \| 0.9407 \| <span style="color:blue">0.9806</span> \| 0.9453 \| <span style="color:blue">0.9928</span> \|\
	\| Baselines: \| \| \| \| \|\
	\| TBM&C³M \| 0.9270 \| 0.9534 \| 0.8972 \| 0.9843 \|\
	\| Roberta-base fine-tuned (black-box) \| 0.9462 \| 0.9778 \| 0.9508 \| 0.9917 \|

	### Part II: CB-LLM (generation)
	The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).

	\| Method \| Metric \| SST2 \| YelpP \| AGnews \| DBpedia \|
	\|---------------------------------\|------------------\|---------\|--------\|---------\|---------\|\
	\| CB-LLM (Ours) \| Accuracy↑ \| 0.9638 \| 0.9855 \| 0.9439 \| 0.9924 \|\
	\| \| Steerability↑ \| 0.82 \| 0.95 \| 0.85 \| 0.76 \|\
	\| \| Perplexity↓ \| 116.22 \| 13.03 \| 18.25 \| 37.59 \|\
	\| CB-LLM w/o ADV training \| Accuracy↑ \| 0.9676 \| 0.9830 \| 0.9418 \| 0.9934 \|\
	\| \| Steerability↑ \| 0.57 \| 0.69 \| 0.52 \| 0.21 \|\
	\| \| Perplexity↓ \| 59.19 \| 12.39 \| 17.93 \| 35.13 \|\
	\| Llama3 finetuned (black-box)\| Accuracy↑ \| 0.9692 \| 0.9851 \| 0.9493 \| 0.9919 \|\
	\| \| Steerability↑ \| No \| No \| No \| No \|\
	\| \| Perplexity↓ \| 84.70 \| 6.62 \| 12.52 \| 41.50 \|

	## Citation

	If you find this work useful, please cite the paper:

	```bibtex
	@article{cbllm,
	title={Concept Bottleneck Large Language Models},
	author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
	journal={ICLR},
	year={2025}
	}
	```