| --- |
| license: cc-by-4.0 |
| language: |
| - en |
| tags: |
| - text-classification |
| - activation-steering |
| - mechanistic-interpretability |
| - poolbench |
| base_model: bert-base-uncased |
| --- |
| |
| # PoolBench — BERT Scorers |
|
|
| Fine-tuned `bert-base-uncased` classifiers for automatic concept scoring of steered LLM outputs. One classifier per concept, trained on the [PoolBench corpus](https://huggingface.co/datasets/nips234678/poolbench). |
|
|
| These are Classifier B in the PoolBench evaluation pipeline: they score whether a steered generation exhibits the target concept, enabling the D2 SCP metric. |
|
|
| ## Concepts (17) |
|
|
| `academic_tone`, `bureaucratic`, `causation`, `code_docs`, `conditionality`, `contrast`, `deference`, `depression`, `frustration`, `hedging`, `imdb_sentiment`, `legal_formality`, `narrative`, `negation_density`, `numerical_precision`, `planning`, `toxicity` |
|
|
| ## File structure |
|
|
| One subdirectory per concept, each a standard HuggingFace `AutoModelForSequenceClassification` checkpoint: |
|
|
| ``` |
| {concept}/config.json |
| {concept}/model.safetensors |
| {concept}/tokenizer files... |
| ``` |
|
|
| ## Loading |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| concept = "causation" |
| tokenizer = AutoTokenizer.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}") |
| model = AutoModelForSequenceClassification.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}") |
| |
| inputs = tokenizer("The result was caused by the earlier event.", return_tensors="pt", truncation=True) |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| pred = logits.argmax(-1).item() # 1 = concept present, 0 = absent |
| ``` |
|
|
| ## Training details |
|
|
| - Base model: `bert-base-uncased` |
| - Training split: 700 passages per class per concept |
| - Evaluation split: 300 passages per class per concept |
| - Labels: 1 = concept present, 0 = concept absent |
|
|
| ## Citation |
|
|
| ``` |
| @misc{poolbench2026, |
| title={PoolBench: Evaluating Pooling Strategies for Activation Steering Vectors}, |
| author={Anonymous}, |
| year={2026}, |
| } |
| ``` |
|
|