license: agpl-3.0
datasets:
- ntphuc149/ViLegalTexts
language:
- vi
pipeline_tag: fill-mask
library_name: transformers
tags:
- vilegallm
- legal
- vietnamese
- phobert
- continual-pretraining
- legal-nlp
ViLegalBERT
ViLegalBERT is an encoder-only language model for Vietnamese legal text understanding, part of the ViLegalLM suite. It is continually pretrained from PhoBERT-base-v2 on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.
Paper: ViLegalLM: Language Models for Vietnamese Legal Text — ACL 2026
Resources: GitHub | ViLegalQwen2.5-1.5B-Base | ViLegalQwen3-1.7B-Base
How to Use
ViLegalBERT shares the same tokenizer as PhoBERT-base-v2. Load the model weights from this repository and the tokenizer from PhoBERT:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
Note: Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the PhoBERT-base-v2 model page for installation and usage instructions.
Model Summary
Summary for ViLegalBERT checkpoint (click to expand)
| Attribute | Value |
|---|---|
| Architecture | RoBERTa (encoder-only) |
| Parameters | 135M |
| Base model | PhoBERT-base-v2 |
| Max sequence length | 256 tokens |
| Tokenizer | PhoBERT tokenizer (PyVi word segmentation) |
| Training objective | Masked Language Modeling (MLM, 15% mask) |
| Training domain | Vietnamese legal text |
Evaluation Results
ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. Bold = best in encoder group.
Information Retrieval (click to expand)
| Model | nDCG@10 | MRR@10 | MAP@100 |
|---|---|---|---|
| ALQAC-IR | |||
| PhoBERT-base | 0.5094 | 0.4613 | 0.4720 |
| PhoBERT-base-v2 | 0.6279 | 0.5710 | 0.5779 |
| VNLawBERT | 0.6581 | 0.6053 | 0.6081 |
| ViLegalBERT | 0.6786 | 0.6248 | 0.6304 |
| ZALO-IR | |||
| PhoBERT-base | 0.5533 | 0.5165 | 0.5188 |
| PhoBERT-base-v2 | 0.5936 | 0.5541 | 0.5597 |
| VNLawBERT | 0.6020 | 0.5550 | 0.5609 |
| ViLegalBERT | 0.6300 | 0.5878 | 0.5912 |
Question Answering (click to expand)
True/False — ALQAC-TF
| Model | Pre | Rec | F1 |
|---|---|---|---|
| True/False (ALQAC-TF) | |||
| PhoBERT-base | 67.26 | 77.16 | 71.87 |
| PhoBERT-base-v2 | 76.88 | 67.51 | 71.89 |
| VNLawBERT | 72.91 | 75.13 | 74.00 |
| ViLegalBERT | 70.97 | 78.17 | 74.40 |
Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK
| Model | Pre_mac | Rec_mac | F1_mac |
|---|---|---|---|
| ALQAC-MCQ | |||
| PhoBERT-base | 62.03 | 62.77 | 62.05 |
| PhoBERT-base-v2 | 62.96 | 63.70 | 58.39 |
| VNLawBERT | 61.00 | 61.12 | 60.80 |
| ViLegalBERT | 62.76 | 63.71 | 62.83 |
| VLSP-MCQ-LK | |||
| PhoBERT-base | 55.75 | 53.18 | 53.34 |
| PhoBERT-base-v2 | 57.67 | 54.46 | 55.70 |
| VNLawBERT | 56.87 | 52.35 | 54.19 |
| ViLegalBERT | 58.05 | 56.30 | 56.53 |
Extractive QA — ALQAC-EQA & ViBidLQA-EQA
| Model | EM | F1 |
|---|---|---|
| ALQAC-EQA | ||
| PhoBERT-base | 37.25 | 64.46 |
| PhoBERT-base-v2 | 40.58 | 68.01 |
| VNLawBERT | 38.82 | 65.85 |
| ViLegalBERT | 41.17 | 65.92 |
| ViBidLQA-EQA | ||
| PhoBERT-base | 49.34 | 77.45 |
| PhoBERT-base-v2 | 50.66 | 78.59 |
| VNLawBERT | 48.49 | 76.32 |
| ViLegalBERT | 50.19 | 78.63 |
Natural Language Inference (click to expand)
NLI — VLSP-NLI
| Model | Precision | Recall | F1 |
|---|---|---|---|
| PhoBERT-base | 59.26 | 21.33 | 31.37 |
| PhoBERT-base-v2 | 55.26 | 28.00 | 37.17 |
| VNLawBERT | 62.50 | 13.33 | 21.98 |
| ViLegalBERT | 72.28 | 97.33 | 82.95 |
Also in ViLegalLM
| Model | Architecture | Params | Context |
|---|---|---|---|
| ViLegalBERT (this model) | Encoder-only | 135M | 256 |
| ViLegalQwen2.5-1.5B-Base | Decoder-only | 1.54B | 2048 |
| ViLegalQwen3-1.7B-Base | Decoder-only | 1.72B | 4096 |
Limitations and Biases
- Domain scope: Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
- Context length: Maximum 256 tokens, which may limit performance on long legal passages.
- Temporal bias: Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
- Inherited biases: May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
- Not a legal authority: Model outputs should never be used as definitive legal interpretations without expert validation.
Intended Use
Intended for:
- Vietnamese legal information retrieval and document ranking
- Legal question answering (True/False, Multiple Choice, Extractive)
- Natural language inference on Vietnamese legal text
- Feature extraction and semantic similarity for Vietnamese legal documents
- Research on Vietnamese legal NLP
Not intended for:
- Replacing professional legal counsel or human judgment in legal decision-making
- Providing legal advice without expert validation
- Legal systems outside Vietnam without careful domain adaptation
Citation
If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:
<!-- ViLegalLM citation — available soon -->
@inproceedings{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year = {2020},
pages = {1037--1042}
}
License
This model is derived from PhoBERT-base-v2 which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.