| --- |
| license: agpl-3.0 |
| datasets: |
| - ntphuc149/ViLegalTexts |
| language: |
| - vi |
| pipeline_tag: fill-mask |
| library_name: transformers |
| tags: |
| - vilegallm |
| - legal |
| - vietnamese |
| - phobert |
| - continual-pretraining |
| - legal-nlp |
| --- |
| |
| # ViLegalBERT |
|
|
| ViLegalBERT is an encoder-only language model for **Vietnamese legal text understanding**, part of the **ViLegalLM** suite. It is continually pretrained from [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference. |
|
|
| **Paper**: _ViLegalLM: Language Models for Vietnamese Legal Text_ — [ACL 2026](https://aclanthology.org/) |
|
|
| **Resources**: [GitHub](https://github.com/ntphuc149/ViLegalLM) | [ViLegalQwen2.5-1.5B-Base](https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base) | [ViLegalQwen3-1.7B-Base](https://huggingface.co/ntphuc149/ViLegalQwen3-1.7B-Base) |
|
|
| --- |
|
|
| ## How to Use |
|
|
| ViLegalBERT shares the same tokenizer as [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2). Load the model weights from this repository and the tokenizer from PhoBERT: |
|
|
| ```python |
| from transformers import AutoModel, AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2") |
| model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT") |
| ``` |
|
|
| > **Note:** Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the [PhoBERT-base-v2 model page](https://huggingface.co/vinai/phobert-base-v2) for installation and usage instructions. |
|
|
| --- |
|
|
| ## Model Summary |
|
|
| <details> |
| <summary><b>Summary for ViLegalBERT checkpoint</b> (click to expand)</summary> |
|
|
| | Attribute | Value | |
| | ------------------- | --------------------------------------------------------------- | |
| | Architecture | RoBERTa (encoder-only) | |
| | Parameters | 135M | |
| | Base model | [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) | |
| | Max sequence length | 256 tokens | |
| | Tokenizer | PhoBERT tokenizer (PyVi word segmentation) | |
| | Training objective | Masked Language Modeling (MLM, 15% mask) | |
| | Training domain | Vietnamese legal text | |
|
|
| </details> |
|
|
| --- |
|
|
| ## Evaluation Results |
|
|
| ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. **Bold** = best in encoder group. |
|
|
| <details> |
| <summary><b>Information Retrieval</b> (click to expand)</summary> |
|
|
| | Model | nDCG@10 | MRR@10 | MAP@100 | |
| | --------------- | ---------- | ---------- | ---------- | |
| | _**ALQAC-IR**_ | | | | |
| | PhoBERT-base | 0.5094 | 0.4613 | 0.4720 | |
| | PhoBERT-base-v2 | 0.6279 | 0.5710 | 0.5779 | |
| | VNLawBERT | 0.6581 | 0.6053 | 0.6081 | |
| | **ViLegalBERT** | **0.6786** | **0.6248** | **0.6304** | |
| | _**ZALO-IR**_ | | | | |
| | PhoBERT-base | 0.5533 | 0.5165 | 0.5188 | |
| | PhoBERT-base-v2 | 0.5936 | 0.5541 | 0.5597 | |
| | VNLawBERT | 0.6020 | 0.5550 | 0.5609 | |
| | **ViLegalBERT** | **0.6300** | **0.5878** | **0.5912** | |
|
|
| </details> |
|
|
| <details> |
| <summary><b>Question Answering</b> (click to expand)</summary> |
|
|
| **True/False — ALQAC-TF** |
|
|
| | Model | Pre | Rec | F1 | |
| | --------------------------- | --------- | --------- | --------- | |
| | _**True/False (ALQAC-TF)**_ | | | | |
| | PhoBERT-base | 67.26 | 77.16 | 71.87 | |
| | PhoBERT-base-v2 | **76.88** | 67.51 | 71.89 | |
| | VNLawBERT | 72.91 | 75.13 | 74.00 | |
| | **ViLegalBERT** | 70.97 | **78.17** | **74.40** | |
|
|
| **Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK** |
|
|
| | Model | Pre_mac | Rec_mac | F1_mac | |
| | ----------------- | --------- | --------- | --------- | |
| | _**ALQAC-MCQ**_ | | | | |
| | PhoBERT-base | 62.03 | 62.77 | 62.05 | |
| | PhoBERT-base-v2 | **62.96** | 63.70 | 58.39 | |
| | VNLawBERT | 61.00 | 61.12 | 60.80 | |
| | **ViLegalBERT** | 62.76 | **63.71** | **62.83** | |
| | _**VLSP-MCQ-LK**_ | | | | |
| | PhoBERT-base | 55.75 | 53.18 | 53.34 | |
| | PhoBERT-base-v2 | 57.67 | 54.46 | 55.70 | |
| | VNLawBERT | 56.87 | 52.35 | 54.19 | |
| | **ViLegalBERT** | **58.05** | **56.30** | **56.53** | |
|
|
| **Extractive QA — ALQAC-EQA & ViBidLQA-EQA** |
|
|
| | Model | EM | F1 | |
| | -------------------| --------- | --------- | |
| | _**ALQAC-EQA**_ | | | |
| | PhoBERT-base | 37.25 | 64.46 | |
| | PhoBERT-base-v2 | 40.58 | **68.01** | |
| | VNLawBERT | 38.82 | 65.85 | |
| | **ViLegalBERT** | **41.17** | 65.92 | |
| | _**ViBidLQA-EQA**_ | | | |
| | PhoBERT-base | 49.34 | 77.45 | |
| | PhoBERT-base-v2 | **50.66** | 78.59 | |
| | VNLawBERT | 48.49 | 76.32 | |
| | **ViLegalBERT** | 50.19 | **78.63** | |
|
|
| </details> |
|
|
| <details> |
| <summary><b>Natural Language Inference</b> (click to expand)</summary> |
|
|
| **NLI — VLSP-NLI** |
|
|
| | Model | Precision | Recall | F1 | |
| | --------------- | --------- | --------- | --------- | |
| | PhoBERT-base | 59.26 | 21.33 | 31.37 | |
| | PhoBERT-base-v2 | 55.26 | 28.00 | 37.17 | |
| | VNLawBERT | 62.50 | 13.33 | 21.98 | |
| | **ViLegalBERT** | **72.28** | **97.33** | **82.95** | |
|
|
| </details> |
|
|
| --- |
|
|
| ## Also in ViLegalLM |
|
|
| | Model | Architecture | Params | Context | |
| | ------------------------------------------------------ | ------------ | ------ | ------- | |
| | ViLegalBERT (this model) | Encoder-only | 135M | 256 | |
| | [ViLegalQwen2.5-1.5B-Base](https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base) | Decoder-only | 1.54B | 2048 | |
| | [ViLegalQwen3-1.7B-Base](https://huggingface.co/ntphuc149/ViLegalQwen3-1.7B-Base) | Decoder-only | 1.72B | 4096 | |
|
|
| --- |
|
|
| ## Limitations and Biases |
|
|
| - **Domain scope:** Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions. |
| - **Context length:** Maximum 256 tokens, which may limit performance on long legal passages. |
| - **Temporal bias:** Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes. |
| - **Inherited biases:** May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances. |
| - **Not a legal authority:** Model outputs should never be used as definitive legal interpretations without expert validation. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| **Intended for:** |
|
|
| - Vietnamese legal information retrieval and document ranking |
| - Legal question answering (True/False, Multiple Choice, Extractive) |
| - Natural language inference on Vietnamese legal text |
| - Feature extraction and semantic similarity for Vietnamese legal documents |
| - Research on Vietnamese legal NLP |
|
|
| **Not intended for:** |
|
|
| - Replacing professional legal counsel or human judgment in legal decision-making |
| - Providing legal advice without expert validation |
| - Legal systems outside Vietnam without careful domain adaptation |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper: |
|
|
| ```bibtex |
| <!-- ViLegalLM citation — available soon --> |
| ``` |
|
|
| ```bibtex |
| @inproceedings{phobert, |
| title = {{PhoBERT: Pre-trained language models for Vietnamese}}, |
| author = {Dat Quoc Nguyen and Anh Tuan Nguyen}, |
| booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020}, |
| year = {2020}, |
| pages = {1037--1042} |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| [AGPL-3.0](https://www.gnu.org/licenses/agpl-3.0.html) |
|
|
| This model is derived from [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available. |