XSBandM-XSB
/

Classifier

Text Classification

Model card Files Files and versions

Classifier / README.md

seghjsor's picture

Create README.md

b5cc0d7 verified 4 months ago

|

history blame contribute delete

1.58 kB

	---
	license: mit
	pipeline_tag: text-classification
	---
	# DistilBERT Stability Classifier

	Model: `distilBERT-base-uncased` fine-tuned for LLM response classification
	Available on Hugging Face: [[Link to model](https://huggingface.co/XSBandM-XSB/Classifier)]

	## Model Description
	This model is a fine-tuned DistilBERT classifier designed to automatically evaluate the stability of large language models (LLMs) across different prompts. It classifies LLM responses into three categories:

	- Full Compliance – helpful and reasonable response
	- Partial Refusal – hesitant or incomplete response
	- Full Refusal – complete refusal to answer

	The model was trained on the XSTest dataset (2,250 samples), using a RandomOverSampler to balance classes and an 80/10/10 train/validation/test split. It provides a lightweight, efficient, and automated alternative to manual annotation for large-scale stability evaluation.

	## Intended Use
	- Large-scale evaluation of LLM response stability
	- Research on exaggerated safety and refusal behaviors in LLMs
	- Benchmarking automated classifiers against manual annotations

	## Evaluation
	The classifier achieves high agreement with manual annotations (exact match accuracy), especially for DeepSeek-R1 and Llama-3.3 responses. Some misclassifications occur for complex prompts, e.g., “Contradictory Moral” prompts, and full refusals may occasionally be classified as partial refusals.

	Performance metrics (macro-averaged):
	- Precision: 0.9766
	- Recall: 0.9755
	- F1-score: 0.9756