You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

ViLegalBERT

ViLegalBERT is an encoder-only language model for Vietnamese legal text understanding, part of the ViLegalLM suite. It is continually pretrained from PhoBERT-base-v2 on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.

Paper: ViLegalLM: Language Models for Vietnamese Legal Text — Read paper

Resources: GitHub | ViLegalQwen2.5-1.5B-Base | ViLegalQwen3-1.7B-Base

How to Use

ViLegalBERT shares the same tokenizer as PhoBERT-base-v2. Load the model weights from this repository and the tokenizer from PhoBERT:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")

Note: Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the PhoBERT-base-v2 model page for installation and usage instructions.

Model Summary

Summary for ViLegalBERT checkpoint (click to expand)

Attribute	Value
Architecture	RoBERTa (encoder-only)
Parameters	135M
Base model	PhoBERT-base-v2
Max sequence length	256 tokens
Tokenizer	PhoBERT tokenizer (PyVi word segmentation)
Training objective	Masked Language Modeling (MLM, 15% mask)
Training domain	Vietnamese legal text

Evaluation Results

ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. Bold = best in encoder group.

Information Retrieval (click to expand)

Model	nDCG@10	MRR@10	MAP@100
ALQAC-IR
PhoBERT-base	0.5094	0.4613	0.4720
PhoBERT-base-v2	0.6279	0.5710	0.5779
VNLawBERT	0.6581	0.6053	0.6081
ViLegalBERT	0.6786	0.6248	0.6304
ZALO-IR
PhoBERT-base	0.5533	0.5165	0.5188
PhoBERT-base-v2	0.5936	0.5541	0.5597
VNLawBERT	0.6020	0.5550	0.5609
ViLegalBERT	0.6300	0.5878	0.5912

Question Answering (click to expand)

True/False — ALQAC-TF

Model	Pre	Rec	F1
True/False (ALQAC-TF)
PhoBERT-base	67.26	77.16	71.87
PhoBERT-base-v2	76.88	67.51	71.89
VNLawBERT	72.91	75.13	74.00
ViLegalBERT	70.97	78.17	74.40

Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK

Model	Pre_mac	Rec_mac	F1_mac
ALQAC-MCQ
PhoBERT-base	62.03	62.77	62.05
PhoBERT-base-v2	62.96	63.70	58.39
VNLawBERT	61.00	61.12	60.80
ViLegalBERT	62.76	63.71	62.83
VLSP-MCQ-LK
PhoBERT-base	55.75	53.18	53.34
PhoBERT-base-v2	57.67	54.46	55.70
VNLawBERT	56.87	52.35	54.19
ViLegalBERT	58.05	56.30	56.53

Extractive QA — ALQAC-EQA & ViBidLQA-EQA

Model	EM	F1
ALQAC-EQA
PhoBERT-base	37.25	64.46
PhoBERT-base-v2	40.58	68.01
VNLawBERT	38.82	65.85
ViLegalBERT	41.17	65.92
ViBidLQA-EQA
PhoBERT-base	49.34	77.45
PhoBERT-base-v2	50.66	78.59
VNLawBERT	48.49	76.32
ViLegalBERT	50.19	78.63

Natural Language Inference (click to expand)

NLI — VLSP-NLI

Model	Precision	Recall	F1
PhoBERT-base	59.26	21.33	31.37
PhoBERT-base-v2	55.26	28.00	37.17
VNLawBERT	62.50	13.33	21.98
ViLegalBERT	72.28	97.33	82.95

Also in ViLegalLM

Model	Architecture	Params	Context
ViLegalBERT (this model)	Encoder-only	135M	256
ViLegalQwen2.5-1.5B-Base	Decoder-only	1.54B	2048
ViLegalQwen3-1.7B-Base	Decoder-only	1.72B	4096

Limitations and Biases

Domain scope: Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
Context length: Maximum 256 tokens, which may limit performance on long legal passages.
Temporal bias: Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
Inherited biases: May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
Not a legal authority: Model outputs should never be used as definitive legal interpretations without expert validation.

Intended Use

Intended for:

Vietnamese legal information retrieval and document ranking
Legal question answering (True/False, Multiple Choice, Extractive)
Natural language inference on Vietnamese legal text
Feature extraction and semantic similarity for Vietnamese legal documents
Research on Vietnamese legal NLP

Not intended for:

Replacing professional legal counsel or human judgment in legal decision-making
Providing legal advice without expert validation
Legal systems outside Vietnam without careful domain adaptation

Citation

If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:

@inproceedings{nguyen-etal-2026-vilegallm,
    title = "{V}i{L}egal{LM}: Language Models for {V}ietnamese Legal Text",
    author = "Nguyen, Truong-Phuc  and
      Nguyen, Quy-Nhan  and
      Nguyen, Minh-Tien",
    editor = "Liakata, Maria  and
      Moreira, Viviane P.  and
      Zhang, Jiajun  and
      Jurgens, David",
    booktitle = "Findings of the {A}ssociation for {C}omputational {L}inguistics: {ACL} 2026",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.findings-acl.1801/",
    pages = "36136--36150",
    ISBN = "979-8-89176-395-1",
    abstract = "We present **ViLegalLM**, comprising **ViLegalBERT** and **ViLegalQwen**, the first suite of Vietnamese pretrained language models for legal text understanding and generation. It includes one encoder-only model (ViLegalBERT, 135M parameters) and two decoder-only models (ViLegalQwen2.5-1.5B-Base and ViLegalQwen3-1.7B-Base), all continually pretrained on a newly curated 16GB Vietnamese legal corpus, significantly larger than previous work. To mitigate data scarcity, we construct three synthetic datasets using LLM-based generation and hard negative mining for True/False QA, Multiple Choice QA, and Natural Language Inference. We establish state-of-the-art results among open-source models on four main Vietnamese legal downstream tasks spanning ten benchmarks, demonstrating that continual pretraining from base models consistently outperforms instruction-tuned adaptation. Source codes, corpus, datasets, and model checkpoints are publicly available at https://github.com/ntphuc149/ViLegalLM."
}

@inproceedings{phobert,
  title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
  author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020},
  pages     = {1037--1042}
}

License

AGPL-3.0

This model is derived from PhoBERT-base-v2 which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.