File size: 4,580 Bytes
4d92ce0 777a934 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 773b828 fcab8ef 773b828 fcab8ef 773b828 fcab8ef 773b828 fcab8ef 4d92ce0 fcab8ef 4d92ce0 777a934 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef 4d92ce0 fcab8ef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: mit
base_model:
- agentlans/multilingual-e5-small-aligned-v2
language:
- en
- zh
- fr
- pt
- es
- ja
- tr
- ru
- ar
- ko
- th
- it
- de
- vi
- ms
- id
- fil
- hi
- pl
- cs
- nl
- km
- my
- fa
- gu
- ur
- te
- mr
- he
- bn
- ta
- uk
- bo
- kk
- mn
- ug
- yue
datasets:
- agentlans/refusal-classifier-data
pipeline_tag: text-classification
tags:
- text-classification
- multilingual
- refusal-detection
- alignment
- conversation-analysis
- fine-tuned-model
- ethics
- ai-safety
- e5
- transformer
- huggingface
- research
---
# Multilingual Refusal Classifier
This model detects **assistant refusals** in multilingual AI conversations.
It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response.
The model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2),
trained on the [agentlans/refusal-classifier-data](https://huggingface.co/datasets/agentlans/refusal-classifier-data) dataset.
**Evaluation results:**
- **Loss:** 0.2665
- **Accuracy:** 0.9153
- **Training tokens:** 5,347,200
## Usage
This classifier accepts input in conversation-like text formats using structured role tokens.
For long texts, insert `<|...|>` as an ellipsis placeholder in the middle of omitted content.
**Supported input formats:**
- `<|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...`
- `<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...`
**Example:**
```python
from transformers import pipeline
classifier = pipeline(
task="text-classification",
model="agentlans/multilingual-e5-small-refusal-classifier"
)
text = (
"<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. "
"If a pole is laid every certain distance, he needs 30 poles. "
"What is the distance between each pole in feet?"
"<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>"
"ce between poles β 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet."
)
print(classifier(text))
# [{'label': 'Non-refusal', 'score': 0.9906}]
```
## Evaluation Results
The classifier was tested on ten examples translated from the [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) model page.
Full examples are available in [Examples.md](Examples.md).
- π« β The model predicted a **refusal to answer**.
- β― β The model predicted a **valid response**.
| Example | English | French | Spanish | Chinese | Russian | Arabic |
|----------|:--------:|:-------:|:---------:|:---------:|:----------:|:--------:|
| 1 | π« | π« | π« | π« | π« | π« |
| 2 | π« | π« | π« | π« | π« | π« |
| 3 | π« | π« | π« | π« | π« | π« |
| 4 | π« | π« | π« | π« | π« | π« |
| 5 | π« | π« | π« | π« | π« | π« |
| 6 | β― | β― | β― | β― | β― | β― |
| 7 | β― | β― | β― | β― | β― | β― |
| 8 | β― | β― | β― | β― | β― | β― |
| 9 | β― | π« | β― | β― | π« | π« |
| 10 | β― | β― | β― | β― | β― | β― |
The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing.
## Limitations
- **Input length:** 512-token maximum
- **False positives/negatives:** Occasionally similar to the Minos classifier
- **Low-resource languages:** May yield inconsistent predictions
- **Cultural variation:** Expressions of refusal differ linguistically, which can affect accuracy
## Training Details
### Hyperparameters
- **Learning rate:** 5e-5
- **Train batch size:** 8
- **Eval batch size:** 8
- **Seed:** 42
- **Optimizer:** `ADAMW_TORCH_FUSED` (`betas=(0.9, 0.999)`, `epsilon=1e-8`)
- **Scheduler:** Linear
- **Epochs:** 5
### Framework Versions
- Transformers 5.0.0.dev0
- PyTorch 2.9.1+cu128
- Datasets 4.4.1
- Tokenizers 0.22.1
## Intended Use
This model is designed for:
- Identifying **AI refusals** during conversation analysis.
- Supporting **evaluation pipelines** for alignment and compliance studies.
- Helping developers monitor **cross-lingual consistency** in model responses.
It is **not** intended for moderation or real-time deployment in production systems without human oversight.
|