File size: 4,580 Bytes
4d92ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
777a934
 
 
 
 
 
 
 
 
 
 
 
 
4d92ce0
 
fcab8ef
4d92ce0
fcab8ef
 
4d92ce0
fcab8ef
 
4d92ce0
fcab8ef
 
 
 
4d92ce0
fcab8ef
4d92ce0
fcab8ef
 
4d92ce0
fcab8ef
 
 
4d92ce0
fcab8ef
773b828
fcab8ef
 
773b828
fcab8ef
 
 
 
773b828
fcab8ef
 
 
 
 
 
 
773b828
fcab8ef
 
 
4d92ce0
fcab8ef
4d92ce0
777a934
fcab8ef
4d92ce0
fcab8ef
 
4d92ce0
fcab8ef
 
 
 
 
 
 
 
 
 
 
 
4d92ce0
fcab8ef
4d92ce0
fcab8ef
4d92ce0
fcab8ef
 
 
 
4d92ce0
fcab8ef
4d92ce0
fcab8ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: mit
base_model:
- agentlans/multilingual-e5-small-aligned-v2
language:
- en
- zh
- fr
- pt
- es
- ja
- tr
- ru
- ar
- ko
- th
- it
- de
- vi
- ms
- id
- fil
- hi
- pl
- cs
- nl
- km
- my
- fa
- gu
- ur
- te
- mr
- he
- bn
- ta
- uk
- bo
- kk
- mn
- ug
- yue
datasets:
- agentlans/refusal-classifier-data
pipeline_tag: text-classification
tags:
  - text-classification
  - multilingual
  - refusal-detection
  - alignment
  - conversation-analysis
  - fine-tuned-model
  - ethics
  - ai-safety
  - e5
  - transformer
  - huggingface
  - research
---

# Multilingual Refusal Classifier

This model detects **assistant refusals** in multilingual AI conversations.
It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response.

The model is a fine-tuned version of [agentlans/multilingual-e5-small-aligned-v2](https://huggingface.co/agentlans/multilingual-e5-small-aligned-v2), 
trained on the [agentlans/refusal-classifier-data](https://huggingface.co/datasets/agentlans/refusal-classifier-data) dataset.

**Evaluation results:**
- **Loss:** 0.2665  
- **Accuracy:** 0.9153  
- **Training tokens:** 5,347,200  

## Usage

This classifier accepts input in conversation-like text formats using structured role tokens.  
For long texts, insert `<|...|>` as an ellipsis placeholder in the middle of omitted content.

**Supported input formats:**
- `<|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...`
- `<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...`

**Example:**

```python
from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="agentlans/multilingual-e5-small-refusal-classifier"
)

text = (
    "<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. "
    "If a pole is laid every certain distance, he needs 30 poles. "
    "What is the distance between each pole in feet?"
    "<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>"
    "ce between poles β‰ˆ 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet."
)

print(classifier(text))
# [{'label': 'Non-refusal', 'score': 0.9906}]
```

## Evaluation Results

The classifier was tested on ten examples translated from the [NousResearch/Minos-v1](https://huggingface.co/NousResearch/Minos-v1) model page.
Full examples are available in [Examples.md](Examples.md).

- 🚫 β€” The model predicted a **refusal to answer**.  
- β—― β€” The model predicted a **valid response**.

| Example | English | French | Spanish | Chinese | Russian | Arabic |
|----------|:--------:|:-------:|:---------:|:---------:|:----------:|:--------:|
| 1        | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 |
| 2        | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 |
| 3        | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 |
| 4        | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 |
| 5        | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 | 🚫 |
| 6        | β—― | β—― | β—― | β—― | β—― | β—― |
| 7        | β—― | β—― | β—― | β—― | β—― | β—― |
| 8        | β—― | β—― | β—― | β—― | β—― | β—― |
| 9        | β—― | 🚫 | β—― | β—― | 🚫 | 🚫 |
| 10       | β—― | β—― | β—― | β—― | β—― | β—― |

The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing.

## Limitations

- **Input length:** 512-token maximum  
- **False positives/negatives:** Occasionally similar to the Minos classifier  
- **Low-resource languages:** May yield inconsistent predictions  
- **Cultural variation:** Expressions of refusal differ linguistically, which can affect accuracy  

## Training Details

### Hyperparameters
- **Learning rate:** 5e-5  
- **Train batch size:** 8  
- **Eval batch size:** 8  
- **Seed:** 42  
- **Optimizer:** `ADAMW_TORCH_FUSED` (`betas=(0.9, 0.999)`, `epsilon=1e-8`)  
- **Scheduler:** Linear  
- **Epochs:** 5  

### Framework Versions
- Transformers 5.0.0.dev0  
- PyTorch 2.9.1+cu128  
- Datasets 4.4.1  
- Tokenizers 0.22.1  

## Intended Use

This model is designed for:
- Identifying **AI refusals** during conversation analysis.  
- Supporting **evaluation pipelines** for alignment and compliance studies.  
- Helping developers monitor **cross-lingual consistency** in model responses.  

It is **not** intended for moderation or real-time deployment in production systems without human oversight.