You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

ViLegalBERT

ViLegalBERT is an encoder-only language model for Vietnamese legal text understanding, part of the ViLegalLM suite. It is continually pretrained from PhoBERT-base-v2 on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.

Paper: ViLegalLM: Language Models for Vietnamese Legal Text โ€” ACL 2026

Resources: GitHub | ViLegalQwen2.5-1.5B-Base | ViLegalQwen3-1.7B-Base


How to Use

ViLegalBERT shares the same tokenizer as PhoBERT-base-v2. Load the model weights from this repository and the tokenizer from PhoBERT:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")

Note: Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models โ€” please refer to the PhoBERT-base-v2 model page for installation and usage instructions.


Model Summary

Summary for ViLegalBERT checkpoint (click to expand)
Attribute Value
Architecture RoBERTa (encoder-only)
Parameters 135M
Base model PhoBERT-base-v2
Max sequence length 256 tokens
Tokenizer PhoBERT tokenizer (PyVi word segmentation)
Training objective Masked Language Modeling (MLM, 15% mask)
Training domain Vietnamese legal text

Evaluation Results

ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. Bold = best in encoder group.

Information Retrieval (click to expand)
Model nDCG@10 MRR@10 MAP@100
ALQAC-IR
PhoBERT-base 0.5094 0.4613 0.4720
PhoBERT-base-v2 0.6279 0.5710 0.5779
VNLawBERT 0.6581 0.6053 0.6081
ViLegalBERT 0.6786 0.6248 0.6304
ZALO-IR
PhoBERT-base 0.5533 0.5165 0.5188
PhoBERT-base-v2 0.5936 0.5541 0.5597
VNLawBERT 0.6020 0.5550 0.5609
ViLegalBERT 0.6300 0.5878 0.5912
Question Answering (click to expand)

True/False โ€” ALQAC-TF

Model Pre Rec F1
True/False (ALQAC-TF)
PhoBERT-base 67.26 77.16 71.87
PhoBERT-base-v2 76.88 67.51 71.89
VNLawBERT 72.91 75.13 74.00
ViLegalBERT 70.97 78.17 74.40

Multiple Choice โ€” ALQAC-MCQ & VLSP-MCQ-LK

Model Pre_mac Rec_mac F1_mac
ALQAC-MCQ
PhoBERT-base 62.03 62.77 62.05
PhoBERT-base-v2 62.96 63.70 58.39
VNLawBERT 61.00 61.12 60.80
ViLegalBERT 62.76 63.71 62.83
VLSP-MCQ-LK
PhoBERT-base 55.75 53.18 53.34
PhoBERT-base-v2 57.67 54.46 55.70
VNLawBERT 56.87 52.35 54.19
ViLegalBERT 58.05 56.30 56.53

Extractive QA โ€” ALQAC-EQA & ViBidLQA-EQA

Model EM F1
ALQAC-EQA
PhoBERT-base 37.25 64.46
PhoBERT-base-v2 40.58 68.01
VNLawBERT 38.82 65.85
ViLegalBERT 41.17 65.92
ViBidLQA-EQA
PhoBERT-base 49.34 77.45
PhoBERT-base-v2 50.66 78.59
VNLawBERT 48.49 76.32
ViLegalBERT 50.19 78.63
Natural Language Inference (click to expand)

NLI โ€” VLSP-NLI

Model Precision Recall F1
PhoBERT-base 59.26 21.33 31.37
PhoBERT-base-v2 55.26 28.00 37.17
VNLawBERT 62.50 13.33 21.98
ViLegalBERT 72.28 97.33 82.95

Also in ViLegalLM

Model Architecture Params Context
ViLegalBERT (this model) Encoder-only 135M 256
ViLegalQwen2.5-1.5B-Base Decoder-only 1.54B 2048
ViLegalQwen3-1.7B-Base Decoder-only 1.72B 4096

Limitations and Biases

  • Domain scope: Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
  • Context length: Maximum 256 tokens, which may limit performance on long legal passages.
  • Temporal bias: Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
  • Inherited biases: May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
  • Not a legal authority: Model outputs should never be used as definitive legal interpretations without expert validation.

Intended Use

Intended for:

  • Vietnamese legal information retrieval and document ranking
  • Legal question answering (True/False, Multiple Choice, Extractive)
  • Natural language inference on Vietnamese legal text
  • Feature extraction and semantic similarity for Vietnamese legal documents
  • Research on Vietnamese legal NLP

Not intended for:

  • Replacing professional legal counsel or human judgment in legal decision-making
  • Providing legal advice without expert validation
  • Legal systems outside Vietnam without careful domain adaptation

Citation

If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:

<!-- ViLegalLM citation โ€” available soon -->
@inproceedings{phobert,
  title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
  author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020},
  pages     = {1037--1042}
}

License

AGPL-3.0

This model is derived from PhoBERT-base-v2 which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including ntphuc149/ViLegalBERT