---
license: cc-by-4.0
language:
  - en
tags:
  - text-classification
  - activation-steering
  - mechanistic-interpretability
  - poolbench
base_model: bert-base-uncased
---

# PoolBench — BERT Scorers

Fine-tuned `bert-base-uncased` classifiers for automatic concept scoring of steered LLM outputs. One classifier per concept, trained on the [PoolBench corpus](https://huggingface.co/datasets/nips234678/poolbench).

These are Classifier B in the PoolBench evaluation pipeline: they score whether a steered generation exhibits the target concept, enabling the D2 SCP metric.

## Concepts (17)

`academic_tone`, `bureaucratic`, `causation`, `code_docs`, `conditionality`, `contrast`, `deference`, `depression`, `frustration`, `hedging`, `imdb_sentiment`, `legal_formality`, `narrative`, `negation_density`, `numerical_precision`, `planning`, `toxicity`

## File structure

One subdirectory per concept, each a standard HuggingFace `AutoModelForSequenceClassification` checkpoint:

```
{concept}/config.json
{concept}/model.safetensors
{concept}/tokenizer files...
```

## Loading

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

concept = "causation"
tokenizer = AutoTokenizer.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}")
model = AutoModelForSequenceClassification.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}")

inputs = tokenizer("The result was caused by the earlier event.", return_tensors="pt", truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits
pred = logits.argmax(-1).item()  # 1 = concept present, 0 = absent
```

## Training details

- Base model: `bert-base-uncased`
- Training split: 700 passages per class per concept
- Evaluation split: 300 passages per class per concept
- Labels: 1 = concept present, 0 = concept absent

## Citation

```
@misc{poolbench2026,
  title={PoolBench: Evaluating Pooling Strategies for Activation Steering Vectors},
  author={Anonymous},
  year={2026},
}
```