File size: 3,058 Bytes
81f78a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02a0f2f
81f78a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ebdbca
 
 
 
 
 
 
 
 
81f78a5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
base_model: microsoft/deberta-v3-base
library_name: peft
language:
- th
tags:
- text-classification
- safety
- content-moderation
- deberta
- lora
pipeline_tag: text-classification
---

# ThaiSafetyClassifier

A binary classifier that predicts whether an LLM response to a given prompt is **safe** or **harmful** for Thai language and culture. Built by fine-tuning [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) with LoRA for parameter-efficient training.

## Model Details

- **Model type:** Text classification (binary)
- **Base model:** `microsoft/deberta-v3-base`
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)
- **Language:** Thai
- **Labels:** `0` → safe, `1` → harmful

## Input Format

The model takes a prompt–response pair concatenated as:

```
input: <prompt> output: <llm_response>
```

Tokenized with the DeBERTa tokenizer at a maximum sequence length of **256**.

## Training Details

### LoRA Configuration

| Parameter | Value |
|-----------|-------|
| `lora_r` | 8 |
| `lora_alpha` | 16 |
| `lora_dropout` | 0.1 |

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Epochs | 4 |
| Batch size | 32 |
| Max sequence length | 256 |
| Early stopping patience | 3 |

### Loss Function

Class-balanced loss with β = 0.9999 to address class imbalance.

### Dataset

| Split | Samples |
|-------|---------|
| Train | 37,514 |
| Validation | 4,689 |
| Test | 4,690 |
| **Total** | **46,893** |

Class distribution: **79.5% safe**, **20.5% harmful**

## Evaluation Results

Evaluated on the held-out test set (4,690 samples):

| Metric | Score |
|--------|-------|
| Accuracy | 84.4% |
| Weighted F1 | 84.9% |
| Precision | 85.7% |
| Recall | 84.4% |

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

base_model_name = "microsoft/deberta-v3-base"
model_name = "trapoom555/ThaiSafetyClassifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)
model = PeftModel.from_pretrained(base_model, model_name)
model.eval()

prompt = "your prompt here"
response = "llm response here"
text = f"input: {prompt} output: {response}"

inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits
    pred = logits.argmax(-1).item()

label = "harmful" if pred == 1 else "safe"
print(label)
```

## Citation

If you use this model, please cite the relevant works:

```bibtex

@misc{ukarapol2026thaisafetybenchassessinglanguagemodel,
      title={ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts}, 
      author={Trapoom Ukarapol and Nut Chukamphaeng and Kunat Pipatanakul and Pakhapoom Sarapat},
      year={2026},
      eprint={2603.04992},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.04992}, 
}

```