--- library_name: transformers license: apache-2.0 pipeline_tag: text-classification tags: - text-generation - interpretable-ai - concept-bottleneck - llm --- # Concept Bottleneck Large Language Models This repository contains the model described in the paper [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992), accepted by ICLR 2025. - **Paper:** [Concept Bottleneck Large Language Models](https://huggingface.co/papers/2412.07992) - **Project Page:** [https://lilywenglab.github.io/CB-LLMs/](https://lilywenglab.github.io/CB-LLMs/) - **Code:** [https://github.com/Trustworthy-ML-Lab/CB-LLMs](https://github.com/Trustworthy-ML-Lab/CB-LLMs) ## Abstract We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models. ## Usage For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the [official GitHub repository](https://github.com/Trustworthy-ML-Lab/CB-LLMs). ## Key Results ### Part I: CB-LLM (classification) CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC). | Accuracy ↑ | SST2 | YelpP | AGnews | DBpedia | |-----------------------|--------|---------|---------|----------| | **Ours:** | | | | |\ | CB-LLM | 0.9012 | 0.9312 | 0.9009 | 0.9831 |\ | CB-LLM w/ ACC | **0.9407** | **0.9806** | **0.9453** | **0.9928** |\ | **Baselines:** | | | | |\ | TBM&C³M | 0.9270 | 0.9534 | 0.8972 | 0.9843 |\ | Roberta-base fine-tuned (black-box) | 0.9462 | 0.9778 | 0.9508 | 0.9917 | ### Part II: CB-LLM (generation) The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑). | Method | Metric | SST2 | YelpP | AGnews | DBpedia | |---------------------------------|------------------|---------|--------|---------|---------|\ | **CB-LLM (Ours)** | Accuracy↑ | 0.9638 | **0.9855** | 0.9439 | 0.9924 |\ | | Steerability↑ | **0.82** | **0.95** | **0.85** | **0.76** |\ | | Perplexity↓ | 116.22 | 13.03 | 18.25 | 37.59 |\ | **CB-LLM w/o ADV training** | Accuracy↑ | 0.9676 | 0.9830 | 0.9418 | **0.9934** |\ | | Steerability↑ | 0.57 | 0.69 | 0.52 | 0.21 |\ | | Perplexity↓ | **59.19** | 12.39 | 17.93 | **35.13** |\ | **Llama3 finetuned (black-box)**| Accuracy↑ | **0.9692** | 0.9851 | **0.9493** | 0.9919 |\ | | Steerability↑ | No | No | No | No |\ | | Perplexity↓ | 84.70 | **6.62** | **12.52** | 41.50 | ## Citation If you find this work useful, please cite the paper: ```bibtex @article{cbllm, title={Concept Bottleneck Large Language Models}, author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei}, journal={ICLR}, year={2025} } ```