---
license: apache-2.0
base_model: microsoft/deberta-v3-base
library_name: peft
language:
- th
tags:
- text-classification
- safety
- content-moderation
- deberta
- lora
pipeline_tag: text-classification
---

# ThaiSafetyClassifier

A binary classifier that predicts whether an LLM response to a given prompt is **safe** or **harmful** for Thai language and culture. Built by fine-tuning [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) with LoRA for parameter-efficient training.

## Model Details

- **Model type:** Text classification (binary)
- **Base model:** `microsoft/deberta-v3-base`
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)
- **Language:** Thai
- **Labels:** `0` → safe, `1` → harmful

## Input Format

The model takes a prompt–response pair concatenated as:

```
input: <prompt> output: <llm_response>
```

Tokenized with the DeBERTa tokenizer at a maximum sequence length of **256**.

## Training Details

### LoRA Configuration

| Parameter | Value |
|-----------|-------|
| `lora_r` | 8 |
| `lora_alpha` | 16 |
| `lora_dropout` | 0.1 |

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Epochs | 4 |
| Batch size | 32 |
| Max sequence length | 256 |
| Early stopping patience | 3 |

### Loss Function

Class-balanced loss with β = 0.9999 to address class imbalance.

### Dataset

| Split | Samples |
|-------|---------|
| Train | 37,514 |
| Validation | 4,689 |
| Test | 4,690 |
| **Total** | **46,893** |

Class distribution: **79.5% safe**, **20.5% harmful**

## Evaluation Results

Evaluated on the held-out test set (4,690 samples):

| Metric | Score |
|--------|-------|
| Accuracy | 84.4% |
| Weighted F1 | 84.9% |
| Precision | 85.7% |
| Recall | 84.4% |

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch

base_model_name = "microsoft/deberta-v3-base"
model_name = "trapoom555/ThaiSafetyClassifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)
model = PeftModel.from_pretrained(base_model, model_name)
model.eval()

prompt = "your prompt here"
response = "llm response here"
text = f"input: {prompt} output: {response}"

inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits
    pred = logits.argmax(-1).item()

label = "harmful" if pred == 1 else "safe"
print(label)
```

## Citation

If you use this model, please cite the relevant works:

```bibtex

@misc{ukarapol2026thaisafetybenchassessinglanguagemodel,
      title={ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts}, 
      author={Trapoom Ukarapol and Nut Chukamphaeng and Kunat Pipatanakul and Pakhapoom Sarapat},
      year={2026},
      eprint={2603.04992},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.04992}, 
}

```