nips234678
/

poolbench-bert-scorers

Text Classification

activation-steering

mechanistic-interpretability

Model card Files Files and versions

poolbench-bert-scorers / README.md

nips234678's picture

Update README.md

1b3d81d verified about 1 month ago

|

history blame contribute delete

2.02 kB

	---
	license: cc-by-4.0
	language:
	- en
	tags:
	- text-classification
	- activation-steering
	- mechanistic-interpretability
	- poolbench
	base_model: bert-base-uncased
	---

	# PoolBench — BERT Scorers

	Fine-tuned `bert-base-uncased` classifiers for automatic concept scoring of steered LLM outputs. One classifier per concept, trained on the [PoolBench corpus](https://huggingface.co/datasets/nips234678/poolbench).

	These are Classifier B in the PoolBench evaluation pipeline: they score whether a steered generation exhibits the target concept, enabling the D2 SCP metric.

	## Concepts (17)

	`academic_tone`, `bureaucratic`, `causation`, `code_docs`, `conditionality`, `contrast`, `deference`, `depression`, `frustration`, `hedging`, `imdb_sentiment`, `legal_formality`, `narrative`, `negation_density`, `numerical_precision`, `planning`, `toxicity`

	## File structure

	One subdirectory per concept, each a standard HuggingFace `AutoModelForSequenceClassification` checkpoint:

	```
	{concept}/config.json
	{concept}/model.safetensors
	{concept}/tokenizer files...
	```

	## Loading

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	concept = "causation"
	tokenizer = AutoTokenizer.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}")
	model = AutoModelForSequenceClassification.from_pretrained(f"nips234678/poolbench-bert-scorers/{concept}")

	inputs = tokenizer("The result was caused by the earlier event.", return_tensors="pt", truncation=True)
	with torch.no_grad():
	logits = model(**inputs).logits
	pred = logits.argmax(-1).item() # 1 = concept present, 0 = absent
	```

	## Training details

	- Base model: `bert-base-uncased`
	- Training split: 700 passages per class per concept
	- Evaluation split: 300 passages per class per concept
	- Labels: 1 = concept present, 0 = concept absent

	## Citation

	```
	@misc{poolbench2026,
	title={PoolBench: Evaluating Pooling Strategies for Activation Steering Vectors},
	author={Anonymous},
	year={2026},
	}
	```