ViLegalBERT / README.md
ntphuc149's picture
Update README.md
07e4cc9 verified
|
raw
history blame
8.73 kB
---
license: agpl-3.0
datasets:
- ntphuc149/ViLegalTexts
language:
- vi
pipeline_tag: fill-mask
library_name: transformers
tags:
- vilegallm
- legal
- vietnamese
- phobert
- continual-pretraining
- legal-nlp
---
# ViLegalBERT
ViLegalBERT is an encoder-only language model for **Vietnamese legal text understanding**, part of the **ViLegalLM** suite. It is continually pretrained from [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.
**Paper**: _ViLegalLM: Language Models for Vietnamese Legal Text_ — [ACL 2026](https://aclanthology.org/)
**Resources**: [GitHub](https://github.com/ntphuc149/ViLegalLM) | [ViLegalQwen2.5-1.5B-Base](https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base) | [ViLegalQwen3-1.7B-Base](https://huggingface.co/ntphuc149/ViLegalQwen3-1.7B-Base)
---
## How to Use
ViLegalBERT shares the same tokenizer as [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2). Load the model weights from this repository and the tokenizer from PhoBERT:
```python
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")
```
> **Note:** Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the [PhoBERT-base-v2 model page](https://huggingface.co/vinai/phobert-base-v2) for installation and usage instructions.
---
## Model Summary
<details>
<summary><b>Summary for ViLegalBERT checkpoint</b> (click to expand)</summary>
| Attribute | Value |
| ------------------- | --------------------------------------------------------------- |
| Architecture | RoBERTa (encoder-only) |
| Parameters | 135M |
| Base model | [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) |
| Max sequence length | 256 tokens |
| Tokenizer | PhoBERT tokenizer (PyVi word segmentation) |
| Training objective | Masked Language Modeling (MLM, 15% mask) |
| Training domain | Vietnamese legal text |
</details>
---
## Evaluation Results
ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. **Bold** = best in encoder group.
<details>
<summary><b>Information Retrieval</b> (click to expand)</summary>
| Model | nDCG@10 | MRR@10 | MAP@100 |
| --------------- | ---------- | ---------- | ---------- |
| _**ALQAC-IR**_ | | | |
| PhoBERT-base | 0.5094 | 0.4613 | 0.4720 |
| PhoBERT-base-v2 | 0.6279 | 0.5710 | 0.5779 |
| VNLawBERT | 0.6581 | 0.6053 | 0.6081 |
| **ViLegalBERT** | **0.6786** | **0.6248** | **0.6304** |
| _**ZALO-IR**_ | | | |
| PhoBERT-base | 0.5533 | 0.5165 | 0.5188 |
| PhoBERT-base-v2 | 0.5936 | 0.5541 | 0.5597 |
| VNLawBERT | 0.6020 | 0.5550 | 0.5609 |
| **ViLegalBERT** | **0.6300** | **0.5878** | **0.5912** |
</details>
<details>
<summary><b>Question Answering</b> (click to expand)</summary>
**True/False — ALQAC-TF**
| Model | Pre | Rec | F1 |
| --------------------------- | --------- | --------- | --------- |
| _**True/False (ALQAC-TF)**_ | | | |
| PhoBERT-base | 67.26 | 77.16 | 71.87 |
| PhoBERT-base-v2 | **76.88** | 67.51 | 71.89 |
| VNLawBERT | 72.91 | 75.13 | 74.00 |
| **ViLegalBERT** | 70.97 | **78.17** | **74.40** |
**Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK**
| Model | Pre_mac | Rec_mac | F1_mac |
| ----------------- | --------- | --------- | --------- |
| _**ALQAC-MCQ**_ | | | |
| PhoBERT-base | 62.03 | 62.77 | 62.05 |
| PhoBERT-base-v2 | **62.96** | 63.70 | 58.39 |
| VNLawBERT | 61.00 | 61.12 | 60.80 |
| **ViLegalBERT** | 62.76 | **63.71** | **62.83** |
| _**VLSP-MCQ-LK**_ | | | |
| PhoBERT-base | 55.75 | 53.18 | 53.34 |
| PhoBERT-base-v2 | 57.67 | 54.46 | 55.70 |
| VNLawBERT | 56.87 | 52.35 | 54.19 |
| **ViLegalBERT** | **58.05** | **56.30** | **56.53** |
**Extractive QA — ALQAC-EQA & ViBidLQA-EQA**
| Model | EM | F1 |
| -------------------| --------- | --------- |
| _**ALQAC-EQA**_ | | |
| PhoBERT-base | 37.25 | 64.46 |
| PhoBERT-base-v2 | 40.58 | **68.01** |
| VNLawBERT | 38.82 | 65.85 |
| **ViLegalBERT** | **41.17** | 65.92 |
| _**ViBidLQA-EQA**_ | | |
| PhoBERT-base | 49.34 | 77.45 |
| PhoBERT-base-v2 | **50.66** | 78.59 |
| VNLawBERT | 48.49 | 76.32 |
| **ViLegalBERT** | 50.19 | **78.63** |
</details>
<details>
<summary><b>Natural Language Inference</b> (click to expand)</summary>
**NLI — VLSP-NLI**
| Model | Precision | Recall | F1 |
| --------------- | --------- | --------- | --------- |
| PhoBERT-base | 59.26 | 21.33 | 31.37 |
| PhoBERT-base-v2 | 55.26 | 28.00 | 37.17 |
| VNLawBERT | 62.50 | 13.33 | 21.98 |
| **ViLegalBERT** | **72.28** | **97.33** | **82.95** |
</details>
---
## Also in ViLegalLM
| Model | Architecture | Params | Context |
| ------------------------------------------------------ | ------------ | ------ | ------- |
| ViLegalBERT (this model) | Encoder-only | 135M | 256 |
| [ViLegalQwen2.5-1.5B-Base](https://huggingface.co/ntphuc149/ViLegalQwen2.5-1.5B-Base) | Decoder-only | 1.54B | 2048 |
| [ViLegalQwen3-1.7B-Base](https://huggingface.co/ntphuc149/ViLegalQwen3-1.7B-Base) | Decoder-only | 1.72B | 4096 |
---
## Limitations and Biases
- **Domain scope:** Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
- **Context length:** Maximum 256 tokens, which may limit performance on long legal passages.
- **Temporal bias:** Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
- **Inherited biases:** May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
- **Not a legal authority:** Model outputs should never be used as definitive legal interpretations without expert validation.
---
## Intended Use
**Intended for:**
- Vietnamese legal information retrieval and document ranking
- Legal question answering (True/False, Multiple Choice, Extractive)
- Natural language inference on Vietnamese legal text
- Feature extraction and semantic similarity for Vietnamese legal documents
- Research on Vietnamese legal NLP
**Not intended for:**
- Replacing professional legal counsel or human judgment in legal decision-making
- Providing legal advice without expert validation
- Legal systems outside Vietnam without careful domain adaptation
---
## Citation
If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:
```bibtex
<!-- ViLegalLM citation — available soon -->
```
```bibtex
@inproceedings{phobert,
title = {{PhoBERT: Pre-trained language models for Vietnamese}},
author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year = {2020},
pages = {1037--1042}
}
```
---
## License
[AGPL-3.0](https://www.gnu.org/licenses/agpl-3.0.html)
This model is derived from [PhoBERT-base-v2](https://huggingface.co/vinai/phobert-base-v2) which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.