Update README.md

07e4cc9 verified 1 day ago

8.73 kB

license: agpl-3.0
datasets:
  - ntphuc149/ViLegalTexts
language:
  - vi
pipeline_tag: fill-mask
library_name: transformers
tags:
  - vilegallm
  - legal
  - vietnamese
  - phobert
  - continual-pretraining
  - legal-nlp

ViLegalBERT

ViLegalBERT is an encoder-only language model for Vietnamese legal text understanding, part of the ViLegalLM suite. It is continually pretrained from PhoBERT-base-v2 on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.

Paper: ViLegalLM: Language Models for Vietnamese Legal Text — ACL 2026

Resources: GitHub | ViLegalQwen2.5-1.5B-Base | ViLegalQwen3-1.7B-Base

How to Use

ViLegalBERT shares the same tokenizer as PhoBERT-base-v2. Load the model weights from this repository and the tokenizer from PhoBERT:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")

Note: Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the PhoBERT-base-v2 model page for installation and usage instructions.

Model Summary

Summary for ViLegalBERT checkpoint (click to expand)

Attribute	Value
Architecture	RoBERTa (encoder-only)
Parameters	135M
Base model	PhoBERT-base-v2
Max sequence length	256 tokens
Tokenizer	PhoBERT tokenizer (PyVi word segmentation)
Training objective	Masked Language Modeling (MLM, 15% mask)
Training domain	Vietnamese legal text

Evaluation Results

ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. Bold = best in encoder group.

Information Retrieval (click to expand)

Model	nDCG@10	MRR@10	MAP@100
ALQAC-IR
PhoBERT-base	0.5094	0.4613	0.4720
PhoBERT-base-v2	0.6279	0.5710	0.5779
VNLawBERT	0.6581	0.6053	0.6081
ViLegalBERT	0.6786	0.6248	0.6304
ZALO-IR
PhoBERT-base	0.5533	0.5165	0.5188
PhoBERT-base-v2	0.5936	0.5541	0.5597
VNLawBERT	0.6020	0.5550	0.5609
ViLegalBERT	0.6300	0.5878	0.5912

Question Answering (click to expand)

True/False — ALQAC-TF

Model	Pre	Rec	F1
True/False (ALQAC-TF)
PhoBERT-base	67.26	77.16	71.87
PhoBERT-base-v2	76.88	67.51	71.89
VNLawBERT	72.91	75.13	74.00
ViLegalBERT	70.97	78.17	74.40

Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK

Model	Pre_mac	Rec_mac	F1_mac
ALQAC-MCQ
PhoBERT-base	62.03	62.77	62.05
PhoBERT-base-v2	62.96	63.70	58.39
VNLawBERT	61.00	61.12	60.80
ViLegalBERT	62.76	63.71	62.83
VLSP-MCQ-LK
PhoBERT-base	55.75	53.18	53.34
PhoBERT-base-v2	57.67	54.46	55.70
VNLawBERT	56.87	52.35	54.19
ViLegalBERT	58.05	56.30	56.53

Extractive QA — ALQAC-EQA & ViBidLQA-EQA

Model	EM	F1
ALQAC-EQA
PhoBERT-base	37.25	64.46
PhoBERT-base-v2	40.58	68.01
VNLawBERT	38.82	65.85
ViLegalBERT	41.17	65.92
ViBidLQA-EQA
PhoBERT-base	49.34	77.45
PhoBERT-base-v2	50.66	78.59
VNLawBERT	48.49	76.32
ViLegalBERT	50.19	78.63

Natural Language Inference (click to expand)

NLI — VLSP-NLI

Model	Precision	Recall	F1
PhoBERT-base	59.26	21.33	31.37
PhoBERT-base-v2	55.26	28.00	37.17
VNLawBERT	62.50	13.33	21.98
ViLegalBERT	72.28	97.33	82.95

Also in ViLegalLM

Model	Architecture	Params	Context
ViLegalBERT (this model)	Encoder-only	135M	256
ViLegalQwen2.5-1.5B-Base	Decoder-only	1.54B	2048
ViLegalQwen3-1.7B-Base	Decoder-only	1.72B	4096

Limitations and Biases

Domain scope: Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
Context length: Maximum 256 tokens, which may limit performance on long legal passages.
Temporal bias: Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
Inherited biases: May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
Not a legal authority: Model outputs should never be used as definitive legal interpretations without expert validation.

Intended Use

Intended for:

Vietnamese legal information retrieval and document ranking
Legal question answering (True/False, Multiple Choice, Extractive)
Natural language inference on Vietnamese legal text
Feature extraction and semantic similarity for Vietnamese legal documents
Research on Vietnamese legal NLP

Not intended for:

Replacing professional legal counsel or human judgment in legal decision-making
Providing legal advice without expert validation
Legal systems outside Vietnam without careful domain adaptation

Citation

If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:

<!-- ViLegalLM citation — available soon -->

@inproceedings{phobert,
  title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
  author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020},
  pages     = {1037--1042}
}

License

AGPL-3.0

This model is derived from PhoBERT-base-v2 which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.