|
|
--- |
|
|
license: mit |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
# DistilBERT Stability Classifier |
|
|
|
|
|
**Model:** `distilBERT-base-uncased` fine-tuned for LLM response classification |
|
|
**Available on Hugging Face:** [[Link to model](https://huggingface.co/XSBandM-XSB/Classifier)] |
|
|
|
|
|
## Model Description |
|
|
This model is a fine-tuned DistilBERT classifier designed to automatically evaluate the stability of large language models (LLMs) across different prompts. It classifies LLM responses into three categories: |
|
|
|
|
|
- **Full Compliance** – helpful and reasonable response |
|
|
- **Partial Refusal** – hesitant or incomplete response |
|
|
- **Full Refusal** – complete refusal to answer |
|
|
|
|
|
The model was trained on the **XSTest dataset** (2,250 samples), using a RandomOverSampler to balance classes and an 80/10/10 train/validation/test split. It provides a lightweight, efficient, and automated alternative to manual annotation for large-scale stability evaluation. |
|
|
|
|
|
## Intended Use |
|
|
- Large-scale evaluation of LLM response stability |
|
|
- Research on exaggerated safety and refusal behaviors in LLMs |
|
|
- Benchmarking automated classifiers against manual annotations |
|
|
|
|
|
## Evaluation |
|
|
The classifier achieves high agreement with manual annotations (exact match accuracy), especially for DeepSeek-R1 and Llama-3.3 responses. Some misclassifications occur for complex prompts, e.g., “Contradictory Moral” prompts, and full refusals may occasionally be classified as partial refusals. |
|
|
|
|
|
**Performance metrics (macro-averaged):** |
|
|
- Precision: 0.9766 |
|
|
- Recall: 0.9755 |
|
|
- F1-score: 0.9756 |
|
|
|