trapoom555's picture
Update README.md
1ebdbca verified
---
license: apache-2.0
base_model: microsoft/deberta-v3-base
library_name: peft
language:
- th
tags:
- text-classification
- safety
- content-moderation
- deberta
- lora
pipeline_tag: text-classification
---
# ThaiSafetyClassifier
A binary classifier that predicts whether an LLM response to a given prompt is **safe** or **harmful** for Thai language and culture. Built by fine-tuning [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) with LoRA for parameter-efficient training.
## Model Details
- **Model type:** Text classification (binary)
- **Base model:** `microsoft/deberta-v3-base`
- **Fine-tuning method:** LoRA (Low-Rank Adaptation)
- **Language:** Thai
- **Labels:** `0` → safe, `1` → harmful
## Input Format
The model takes a prompt–response pair concatenated as:
```
input: <prompt> output: <llm_response>
```
Tokenized with the DeBERTa tokenizer at a maximum sequence length of **256**.
## Training Details
### LoRA Configuration
| Parameter | Value |
|-----------|-------|
| `lora_r` | 8 |
| `lora_alpha` | 16 |
| `lora_dropout` | 0.1 |
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Epochs | 4 |
| Batch size | 32 |
| Max sequence length | 256 |
| Early stopping patience | 3 |
### Loss Function
Class-balanced loss with β = 0.9999 to address class imbalance.
### Dataset
| Split | Samples |
|-------|---------|
| Train | 37,514 |
| Validation | 4,689 |
| Test | 4,690 |
| **Total** | **46,893** |
Class distribution: **79.5% safe**, **20.5% harmful**
## Evaluation Results
Evaluated on the held-out test set (4,690 samples):
| Metric | Score |
|--------|-------|
| Accuracy | 84.4% |
| Weighted F1 | 84.9% |
| Precision | 85.7% |
| Recall | 84.4% |
## How to Use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel
import torch
base_model_name = "microsoft/deberta-v3-base"
model_name = "trapoom555/ThaiSafetyClassifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2)
model = PeftModel.from_pretrained(base_model, model_name)
model.eval()
prompt = "your prompt here"
response = "llm response here"
text = f"input: {prompt} output: {response}"
inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
pred = logits.argmax(-1).item()
label = "harmful" if pred == 1 else "safe"
print(label)
```
## Citation
If you use this model, please cite the relevant works:
```bibtex
@misc{ukarapol2026thaisafetybenchassessinglanguagemodel,
title={ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts},
author={Trapoom Ukarapol and Nut Chukamphaeng and Kunat Pipatanakul and Pakhapoom Sarapat},
year={2026},
eprint={2603.04992},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.04992},
}
```