Update README.md

07e4cc9 verified 1 day ago

8.73 kB

	---
	license: agpl-3.0
	datasets:
	- ntphuc149/ViLegalTexts
	language:
	- vi
	pipeline_tag: fill-mask
	library_name: transformers
	tags:
	- vilegallm
	- legal
	- vietnamese
	- phobert
	- continual-pretraining
	- legal-nlp
	---

	# ViLegalBERT

	ViLegalBERT is an encoder-only language model for Vietnamese legal text understanding, part of the ViLegalLM suite. It is continually pretrained from [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.

	Paper: _ViLegalLM: Language Models for Vietnamese Legal Text_ — [ACL 2026](https://aclanthology.org/)

	Resources: [GitHub](https://github.com/ntphuc149/ViLegalLM) \| [ViLegalQwen2.5-1.5B-Base](https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base) \| [ViLegalQwen3-1.7B-Base](https://huggingface.co/ntphuc149/ViLegalQwen3-1.7B-Base)

	---

	## How to Use

	ViLegalBERT shares the same tokenizer as [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2). Load the model weights from this repository and the tokenizer from PhoBERT:

	```python
	from transformers import AutoModel, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
	model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
	```

	> Note: Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the [PhoBERT-base-v2 model page](https://huggingface.co/vinai/phobert-base-v2) for installation and usage instructions.

	---

	## Model Summary

	<details>
	<summary><b>Summary for ViLegalBERT checkpoint</b> (click to expand)</summary>

	\| Attribute \| Value \|
	\| ------------------- \| --------------------------------------------------------------- \|
	\| Architecture \| RoBERTa (encoder-only) \|
	\| Parameters \| 135M \|
	\| Base model \| [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) \|
	\| Max sequence length \| 256 tokens \|
	\| Tokenizer \| PhoBERT tokenizer (PyVi word segmentation) \|
	\| Training objective \| Masked Language Modeling (MLM, 15% mask) \|
	\| Training domain \| Vietnamese legal text \|

	</details>

	---

	## Evaluation Results

	ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. Bold = best in encoder group.

	<details>
	<summary><b>Information Retrieval</b> (click to expand)</summary>

	\| Model \| nDCG@10 \| MRR@10 \| MAP@100 \|
	\| --------------- \| ---------- \| ---------- \| ---------- \|
	\| _ALQAC-IR_ \| \| \| \|
	\| PhoBERT-base \| 0.5094 \| 0.4613 \| 0.4720 \|
	\| PhoBERT-base-v2 \| 0.6279 \| 0.5710 \| 0.5779 \|
	\| VNLawBERT \| 0.6581 \| 0.6053 \| 0.6081 \|
	\| ViLegalBERT \| 0.6786 \| 0.6248 \| 0.6304 \|
	\| _ZALO-IR_ \| \| \| \|
	\| PhoBERT-base \| 0.5533 \| 0.5165 \| 0.5188 \|
	\| PhoBERT-base-v2 \| 0.5936 \| 0.5541 \| 0.5597 \|
	\| VNLawBERT \| 0.6020 \| 0.5550 \| 0.5609 \|
	\| ViLegalBERT \| 0.6300 \| 0.5878 \| 0.5912 \|

	</details>

	<details>
	<summary><b>Question Answering</b> (click to expand)</summary>

	True/False — ALQAC-TF

	\| Model \| Pre \| Rec \| F1 \|
	\| --------------------------- \| --------- \| --------- \| --------- \|
	\| _True/False (ALQAC-TF)_ \| \| \| \|
	\| PhoBERT-base \| 67.26 \| 77.16 \| 71.87 \|
	\| PhoBERT-base-v2 \| 76.88 \| 67.51 \| 71.89 \|
	\| VNLawBERT \| 72.91 \| 75.13 \| 74.00 \|
	\| ViLegalBERT \| 70.97 \| 78.17 \| 74.40 \|

	Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK

	\| Model \| Pre_mac \| Rec_mac \| F1_mac \|
	\| ----------------- \| --------- \| --------- \| --------- \|
	\| _ALQAC-MCQ_ \| \| \| \|
	\| PhoBERT-base \| 62.03 \| 62.77 \| 62.05 \|
	\| PhoBERT-base-v2 \| 62.96 \| 63.70 \| 58.39 \|
	\| VNLawBERT \| 61.00 \| 61.12 \| 60.80 \|
	\| ViLegalBERT \| 62.76 \| 63.71 \| 62.83 \|
	\| _VLSP-MCQ-LK_ \| \| \| \|
	\| PhoBERT-base \| 55.75 \| 53.18 \| 53.34 \|
	\| PhoBERT-base-v2 \| 57.67 \| 54.46 \| 55.70 \|
	\| VNLawBERT \| 56.87 \| 52.35 \| 54.19 \|
	\| ViLegalBERT \| 58.05 \| 56.30 \| 56.53 \|

	Extractive QA — ALQAC-EQA & ViBidLQA-EQA

	\| Model \| EM \| F1 \|
	\| -------------------\| --------- \| --------- \|
	\| _ALQAC-EQA_ \| \| \|
	\| PhoBERT-base \| 37.25 \| 64.46 \|
	\| PhoBERT-base-v2 \| 40.58 \| 68.01 \|
	\| VNLawBERT \| 38.82 \| 65.85 \|
	\| ViLegalBERT \| 41.17 \| 65.92 \|
	\| _ViBidLQA-EQA_ \| \| \|
	\| PhoBERT-base \| 49.34 \| 77.45 \|
	\| PhoBERT-base-v2 \| 50.66 \| 78.59 \|
	\| VNLawBERT \| 48.49 \| 76.32 \|
	\| ViLegalBERT \| 50.19 \| 78.63 \|

	</details>

	<details>
	<summary><b>Natural Language Inference</b> (click to expand)</summary>

	NLI — VLSP-NLI

	\| Model \| Precision \| Recall \| F1 \|
	\| --------------- \| --------- \| --------- \| --------- \|
	\| PhoBERT-base \| 59.26 \| 21.33 \| 31.37 \|
	\| PhoBERT-base-v2 \| 55.26 \| 28.00 \| 37.17 \|
	\| VNLawBERT \| 62.50 \| 13.33 \| 21.98 \|
	\| ViLegalBERT \| 72.28 \| 97.33 \| 82.95 \|

	</details>

	---

	## Also in ViLegalLM

	\| Model \| Architecture \| Params \| Context \|
	\| ------------------------------------------------------ \| ------------ \| ------ \| ------- \|
	\| ViLegalBERT (this model) \| Encoder-only \| 135M \| 256 \|
	\| [ViLegalQwen2.5-1.5B-Base](https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base) \| Decoder-only \| 1.54B \| 2048 \|
	\| [ViLegalQwen3-1.7B-Base](https://huggingface.co/ntphuc149/ViLegalQwen3-1.7B-Base) \| Decoder-only \| 1.72B \| 4096 \|

	---

	## Limitations and Biases

	- Domain scope: Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
	- Context length: Maximum 256 tokens, which may limit performance on long legal passages.
	- Temporal bias: Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
	- Inherited biases: May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
	- Not a legal authority: Model outputs should never be used as definitive legal interpretations without expert validation.

	---

	## Intended Use

	Intended for:

	- Vietnamese legal information retrieval and document ranking
	- Legal question answering (True/False, Multiple Choice, Extractive)
	- Natural language inference on Vietnamese legal text
	- Feature extraction and semantic similarity for Vietnamese legal documents
	- Research on Vietnamese legal NLP

	Not intended for:

	- Replacing professional legal counsel or human judgment in legal decision-making
	- Providing legal advice without expert validation
	- Legal systems outside Vietnam without careful domain adaptation

	---

	## Citation

	If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:

	```bibtex
	<!-- ViLegalLM citation — available soon -->
	```

	```bibtex
	@inproceedings{phobert,
	title = {{PhoBERT: Pre-trained language models for Vietnamese}},
	author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
	booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
	year = {2020},
	pages = {1037--1042}
	}
	```

	---

	## License

	[AGPL-3.0](https://www.gnu.org/licenses/agpl-3.0.html)

	This model is derived from [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.