File size: 3,058 Bytes
81f78a5 02a0f2f 81f78a5 1ebdbca 81f78a5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 | ---
license: apache-2.0
base_model: microsoft/deberta-v3-base
library_name: peft
language:
- th
tags:
- text-classification
- safety
- content-moderation
- deberta
- lora
pipeline_tag: text-classification
---
# ThaiSafetyClassifier
A binary classifier that predicts whether an LLM response to a given prompt is **safe** or **harmful** for Thai language and culture. Built by fine-tuning [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) with LoRA for parameter-efficient training.
## Model Details
- **Model type:** Text classification (binary)
- **Base model:** `microsoft/deberta-v3-base`
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)
- **Language:** Thai
- **Labels:** `0` → safe, `1` → harmful
## Input Format
The model takes a prompt–response pair concatenated as:
```
input: <prompt> output: <llm_response>
```
Tokenized with the DeBERTa tokenizer at a maximum sequence length of **256**.
## Training Details
### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| `lora_r` | 8 |
| `lora_alpha` | 16 |
| `lora_dropout` | 0.1 |
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Epochs | 4 |
| Batch size | 32 |
| Max sequence length | 256 |
| Early stopping patience | 3 |
### Loss Function
Class-balanced loss with β = 0.9999 to address class imbalance.
### Dataset
| Split | Samples |
|-------|---------|
| Train | 37,514 |
| Validation | 4,689 |
| Test | 4,690 |
| **Total** | **46,893** |
Class distribution: **79.5% safe**, **20.5% harmful**
## Evaluation Results
Evaluated on the held-out test set (4,690 samples):
| Metric | Score |
|--------|-------|
| Accuracy | 84.4% |
| Weighted F1 | 84.9% |
| Precision | 85.7% |
| Recall | 84.4% |
## How to Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
base_model_name = "microsoft/deberta-v3-base"
model_name = "trapoom555/ThaiSafetyClassifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)
model = PeftModel.from_pretrained(base_model, model_name)
model.eval()
prompt = "your prompt here"
response = "llm response here"
text = f"input: {prompt} output: {response}"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
pred = logits.argmax(-1).item()
label = "harmful" if pred == 1 else "safe"
print(label)
```
## Citation
If you use this model, please cite the relevant works:
```bibtex
@misc{ukarapol2026thaisafetybenchassessinglanguagemodel,
title={ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts},
author={Trapoom Ukarapol and Nut Chukamphaeng and Kunat Pipatanakul and Pakhapoom Sarapat},
year={2026},
eprint={2603.04992},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.04992},
}
```
|