| | --- |
| | license: apache-2.0 |
| | base_model: microsoft/deberta-v3-base |
| | library_name: peft |
| | language: |
| | - th |
| | tags: |
| | - text-classification |
| | - safety |
| | - content-moderation |
| | - deberta |
| | - lora |
| | pipeline_tag: text-classification |
| | --- |
| | |
| | # ThaiSafetyClassifier |
| |
|
| | A binary classifier that predicts whether an LLM response to a given prompt is **safe** or **harmful** for Thai language and culture. Built by fine-tuning [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) with LoRA for parameter-efficient training. |
| |
|
| | ## Model Details |
| |
|
| | - **Model type:** Text classification (binary) |
| | - **Base model:** `microsoft/deberta-v3-base` |
| | - **Fine-tuning method:** LoRA (Low-Rank Adaptation) |
| | - **Language:** Thai |
| | - **Labels:** `0` → safe, `1` → harmful |
| |
|
| | ## Input Format |
| |
|
| | The model takes a prompt–response pair concatenated as: |
| |
|
| | ``` |
| | input: <prompt> output: <llm_response> |
| | ``` |
| |
|
| | Tokenized with the DeBERTa tokenizer at a maximum sequence length of **256**. |
| |
|
| | ## Training Details |
| |
|
| | ### LoRA Configuration |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | `lora_r` | 8 | |
| | | `lora_alpha` | 16 | |
| | | `lora_dropout` | 0.1 | |
| |
|
| | ### Hyperparameters |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Optimizer | AdamW | |
| | | Learning rate | 2e-4 | |
| | | Epochs | 4 | |
| | | Batch size | 32 | |
| | | Max sequence length | 256 | |
| | | Early stopping patience | 3 | |
| |
|
| | ### Loss Function |
| |
|
| | Class-balanced loss with β = 0.9999 to address class imbalance. |
| |
|
| | ### Dataset |
| |
|
| | | Split | Samples | |
| | |-------|---------| |
| | | Train | 37,514 | |
| | | Validation | 4,689 | |
| | | Test | 4,690 | |
| | | **Total** | **46,893** | |
| |
|
| | Class distribution: **79.5% safe**, **20.5% harmful** |
| |
|
| | ## Evaluation Results |
| |
|
| | Evaluated on the held-out test set (4,690 samples): |
| |
|
| | | Metric | Score | |
| | |--------|-------| |
| | | Accuracy | 84.4% | |
| | | Weighted F1 | 84.9% | |
| | | Precision | 85.7% | |
| | | Recall | 84.4% | |
| |
|
| | ## How to Use |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | from peft import PeftModel |
| | import torch |
| | |
| | base_model_name = "microsoft/deberta-v3-base" |
| | model_name = "trapoom555/ThaiSafetyClassifier" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2) |
| | model = PeftModel.from_pretrained(base_model, model_name) |
| | model.eval() |
| | |
| | prompt = "your prompt here" |
| | response = "llm response here" |
| | text = f"input: {prompt} output: {response}" |
| | |
| | inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True) |
| | with torch.no_grad(): |
| | logits = model(**inputs).logits |
| | pred = logits.argmax(-1).item() |
| | |
| | label = "harmful" if pred == 1 else "safe" |
| | print(label) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the relevant works: |
| |
|
| | ```bibtex |
| | |
| | @misc{ukarapol2026thaisafetybenchassessinglanguagemodel, |
| | title={ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts}, |
| | author={Trapoom Ukarapol and Nut Chukamphaeng and Kunat Pipatanakul and Pakhapoom Sarapat}, |
| | year={2026}, |
| | eprint={2603.04992}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2603.04992}, |
| | } |
| | |
| | ``` |
| |
|