--- license: apache-2.0 base_model: microsoft/deberta-v3-base library_name: peft language: - th tags: - text-classification - safety - content-moderation - deberta - lora pipeline_tag: text-classification --- # ThaiSafetyClassifier A binary classifier that predicts whether an LLM response to a given prompt is **safe** or **harmful** for Thai language and culture. Built by fine-tuning [DeBERTaV3-base](https://huggingface.co/microsoft/deberta-v3-base) with LoRA for parameter-efficient training. ## Model Details - **Model type:** Text classification (binary) - **Base model:** `microsoft/deberta-v3-base` - **Fine-tuning method:** LoRA (Low-Rank Adaptation) - **Language:** Thai - **Labels:** `0` → safe, `1` → harmful ## Input Format The model takes a prompt–response pair concatenated as: ``` input: output: ``` Tokenized with the DeBERTa tokenizer at a maximum sequence length of **256**. ## Training Details ### LoRA Configuration | Parameter | Value | |-----------|-------| | `lora_r` | 8 | | `lora_alpha` | 16 | | `lora_dropout` | 0.1 | ### Hyperparameters | Parameter | Value | |-----------|-------| | Optimizer | AdamW | | Learning rate | 2e-4 | | Epochs | 4 | | Batch size | 32 | | Max sequence length | 256 | | Early stopping patience | 3 | ### Loss Function Class-balanced loss with β = 0.9999 to address class imbalance. ### Dataset | Split | Samples | |-------|---------| | Train | 37,514 | | Validation | 4,689 | | Test | 4,690 | | **Total** | **46,893** | Class distribution: **79.5% safe**, **20.5% harmful** ## Evaluation Results Evaluated on the held-out test set (4,690 samples): | Metric | Score | |--------|-------| | Accuracy | 84.4% | | Weighted F1 | 84.9% | | Precision | 85.7% | | Recall | 84.4% | ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification from peft import PeftModel import torch base_model_name = "microsoft/deberta-v3-base" model_name = "trapoom555/ThaiSafetyClassifier" tokenizer = AutoTokenizer.from_pretrained(model_name) base_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=2) model = PeftModel.from_pretrained(base_model, model_name) model.eval() prompt = "your prompt here" response = "llm response here" text = f"input: {prompt} output: {response}" inputs = tokenizer(text, return_tensors="pt", max_length=256, truncation=True) with torch.no_grad(): logits = model(**inputs).logits pred = logits.argmax(-1).item() label = "harmful" if pred == 1 else "safe" print(label) ``` ## Citation If you use this model, please cite the relevant works: ```bibtex @misc{ukarapol2026thaisafetybenchassessinglanguagemodel, title={ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts}, author={Trapoom Ukarapol and Nut Chukamphaeng and Kunat Pipatanakul and Pakhapoom Sarapat}, year={2026}, eprint={2603.04992}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.04992}, } ```