--- license: cc-by-4.0 language: - en tags: - text-classification - activation-steering - mechanistic-interpretability - poolbench base_model: bert-base-uncased --- # PoolBench — BERT Scorers Fine-tuned `bert-base-uncased` classifiers for automatic concept scoring of steered LLM outputs. One classifier per concept, trained on the [PoolBench corpus](https://huggingface.co/datasets/nips234678/poolbench). These are Classifier B in the PoolBench evaluation pipeline: they score whether a steered generation exhibits the target concept, enabling the D2 SCP metric. ## Concepts (17) `academic_tone`, `bureaucratic`, `causation`, `code_docs`, `conditionality`, `contrast`, `deference`, `depression`, `frustration`, `hedging`, `imdb_sentiment`, `legal_formality`, `narrative`, `negation_density`, `numerical_precision`, `planning`, `toxicity` ## File structure One subdirectory per concept, each a standard HuggingFace `AutoModelForSequenceClassification` checkpoint: ``` {concept}/config.json {concept}/model.safetensors {concept}/tokenizer files... ``` ## Loading ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch concept = "causation" tokenizer = AutoTokenizer.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}") model = AutoModelForSequenceClassification.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}") inputs = tokenizer("The result was caused by the earlier event.", return_tensors="pt", truncation=True) with torch.no_grad(): logits = model(**inputs).logits pred = logits.argmax(-1).item() # 1 = concept present, 0 = absent ``` ## Training details - Base model: `bert-base-uncased` - Training split: 700 passages per class per concept - Evaluation split: 300 passages per class per concept - Labels: 1 = concept present, 0 = concept absent ## Citation ``` @misc{poolbench2026, title={PoolBench: Evaluating Pooling Strategies for Activation Steering Vectors}, author={Anonymous}, year={2026}, } ```