|
|
--- |
|
|
license: mit |
|
|
base_model: |
|
|
- agentlans/multilingual-e5-small-aligned-v2 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
- fr |
|
|
- pt |
|
|
- es |
|
|
- ja |
|
|
- tr |
|
|
- ru |
|
|
- ar |
|
|
- ko |
|
|
- th |
|
|
- it |
|
|
- de |
|
|
- vi |
|
|
- ms |
|
|
- id |
|
|
- fil |
|
|
- hi |
|
|
- pl |
|
|
- cs |
|
|
- nl |
|
|
- km |
|
|
- my |
|
|
- fa |
|
|
- gu |
|
|
- ur |
|
|
- te |
|
|
- mr |
|
|
- he |
|
|
- bn |
|
|
- ta |
|
|
- uk |
|
|
- bo |
|
|
- kk |
|
|
- mn |
|
|
- ug |
|
|
- yue |
|
|
datasets: |
|
|
- agentlans/refusal-classifier-data |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- text-classification |
|
|
- multilingual |
|
|
- refusal-detection |
|
|
- alignment |
|
|
- conversation-analysis |
|
|
- fine-tuned-model |
|
|
- ethics |
|
|
- ai-safety |
|
|
- e5 |
|
|
- transformer |
|
|
- huggingface |
|
|
- research |
|
|
--- |
|
|
|
|
|
# Multilingual Refusal Classifier |
|
|
|
|
|
This model detects **assistant refusals** in multilingual AI conversations. |
|
|
It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response. |
|
|
|
|
|
The model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2), |
|
|
trained on the [agentlans/refusal-classifier-data](https://huggingface.co/datasets/agentlans/refusal-classifier-data) dataset. |
|
|
|
|
|
**Evaluation results:** |
|
|
- **Loss:** 0.2665 |
|
|
- **Accuracy:** 0.9153 |
|
|
- **Training tokens:** 5,347,200 |
|
|
|
|
|
## Usage |
|
|
|
|
|
This classifier accepts input in conversation-like text formats using structured role tokens. |
|
|
For long texts, insert `<|...|>` as an ellipsis placeholder in the middle of omitted content. |
|
|
|
|
|
**Supported input formats:** |
|
|
- `<|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...` |
|
|
- `<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...` |
|
|
|
|
|
**Example:** |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline( |
|
|
task="text-classification", |
|
|
model="agentlans/multilingual-e5-small-refusal-classifier" |
|
|
) |
|
|
|
|
|
text = ( |
|
|
"<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. " |
|
|
"If a pole is laid every certain distance, he needs 30 poles. " |
|
|
"What is the distance between each pole in feet?" |
|
|
"<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>" |
|
|
"ce between poles β 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet." |
|
|
) |
|
|
|
|
|
print(classifier(text)) |
|
|
# [{'label': 'Non-refusal', 'score': 0.9906}] |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
The classifier was tested on ten examples translated from the [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) model page. |
|
|
Full examples are available in [Examples.md](Examples.md). |
|
|
|
|
|
- π« β The model predicted a **refusal to answer**. |
|
|
- β― β The model predicted a **valid response**. |
|
|
|
|
|
| Example | English | French | Spanish | Chinese | Russian | Arabic | |
|
|
|----------|:--------:|:-------:|:---------:|:---------:|:----------:|:--------:| |
|
|
| 1 | π« | π« | π« | π« | π« | π« | |
|
|
| 2 | π« | π« | π« | π« | π« | π« | |
|
|
| 3 | π« | π« | π« | π« | π« | π« | |
|
|
| 4 | π« | π« | π« | π« | π« | π« | |
|
|
| 5 | π« | π« | π« | π« | π« | π« | |
|
|
| 6 | β― | β― | β― | β― | β― | β― | |
|
|
| 7 | β― | β― | β― | β― | β― | β― | |
|
|
| 8 | β― | β― | β― | β― | β― | β― | |
|
|
| 9 | β― | π« | β― | β― | π« | π« | |
|
|
| 10 | β― | β― | β― | β― | β― | β― | |
|
|
|
|
|
The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Input length:** 512-token maximum |
|
|
- **False positives/negatives:** Occasionally similar to the Minos classifier |
|
|
- **Low-resource languages:** May yield inconsistent predictions |
|
|
- **Cultural variation:** Expressions of refusal differ linguistically, which can affect accuracy |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Hyperparameters |
|
|
- **Learning rate:** 5e-5 |
|
|
- **Train batch size:** 8 |
|
|
- **Eval batch size:** 8 |
|
|
- **Seed:** 42 |
|
|
- **Optimizer:** `ADAMW_TORCH_FUSED` (`betas=(0.9, 0.999)`, `epsilon=1e-8`) |
|
|
- **Scheduler:** Linear |
|
|
- **Epochs:** 5 |
|
|
|
|
|
### Framework Versions |
|
|
- Transformers 5.0.0.dev0 |
|
|
- PyTorch 2.9.1+cu128 |
|
|
- Datasets 4.4.1 |
|
|
- Tokenizers 0.22.1 |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed for: |
|
|
- Identifying **AI refusals** during conversation analysis. |
|
|
- Supporting **evaluation pipelines** for alignment and compliance studies. |
|
|
- Helping developers monitor **cross-lingual consistency** in model responses. |
|
|
|
|
|
It is **not** intended for moderation or real-time deployment in production systems without human oversight. |
|
|
|