ViLegalBERT / README.md
ntphuc149's picture
Update README.md
07e4cc9 verified
|
raw
history blame
8.73 kB
metadata
license: agpl-3.0
datasets:
  - ntphuc149/ViLegalTexts
language:
  - vi
pipeline_tag: fill-mask
library_name: transformers
tags:
  - vilegallm
  - legal
  - vietnamese
  - phobert
  - continual-pretraining
  - legal-nlp

ViLegalBERT

ViLegalBERT is an encoder-only language model for Vietnamese legal text understanding, part of the ViLegalLM suite. It is continually pretrained from PhoBERT-base-v2 on a newly curated 16GB Vietnamese legal corpus. ViLegalBERT achieves state-of-the-art results across Vietnamese legal downstream tasks including Information Retrieval, Question Answering, and Natural Language Inference.

Paper: ViLegalLM: Language Models for Vietnamese Legal Text — ACL 2026

Resources: GitHub | ViLegalQwen2.5-1.5B-Base | ViLegalQwen3-1.7B-Base


How to Use

ViLegalBERT shares the same tokenizer as PhoBERT-base-v2. Load the model weights from this repository and the tokenizer from PhoBERT:

from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base-v2")
model = AutoModel.from_pretrained("ntphuc149/ViLegalBERT")

Note: Input text should be word-segmented before tokenizing. Also, PhoBERT's tokenizer has a specific installation process that differs from standard Hugging Face models — please refer to the PhoBERT-base-v2 model page for installation and usage instructions.


Model Summary

Summary for ViLegalBERT checkpoint (click to expand)
Attribute Value
Architecture RoBERTa (encoder-only)
Parameters 135M
Base model PhoBERT-base-v2
Max sequence length 256 tokens
Tokenizer PhoBERT tokenizer (PyVi word segmentation)
Training objective Masked Language Modeling (MLM, 15% mask)
Training domain Vietnamese legal text

Evaluation Results

ViLegalBERT achieves state-of-the-art results among encoder-only models across all evaluated Vietnamese legal benchmarks. Bold = best in encoder group.

Information Retrieval (click to expand)
Model nDCG@10 MRR@10 MAP@100
ALQAC-IR
PhoBERT-base 0.5094 0.4613 0.4720
PhoBERT-base-v2 0.6279 0.5710 0.5779
VNLawBERT 0.6581 0.6053 0.6081
ViLegalBERT 0.6786 0.6248 0.6304
ZALO-IR
PhoBERT-base 0.5533 0.5165 0.5188
PhoBERT-base-v2 0.5936 0.5541 0.5597
VNLawBERT 0.6020 0.5550 0.5609
ViLegalBERT 0.6300 0.5878 0.5912
Question Answering (click to expand)

True/False — ALQAC-TF

Model Pre Rec F1
True/False (ALQAC-TF)
PhoBERT-base 67.26 77.16 71.87
PhoBERT-base-v2 76.88 67.51 71.89
VNLawBERT 72.91 75.13 74.00
ViLegalBERT 70.97 78.17 74.40

Multiple Choice — ALQAC-MCQ & VLSP-MCQ-LK

Model Pre_mac Rec_mac F1_mac
ALQAC-MCQ
PhoBERT-base 62.03 62.77 62.05
PhoBERT-base-v2 62.96 63.70 58.39
VNLawBERT 61.00 61.12 60.80
ViLegalBERT 62.76 63.71 62.83
VLSP-MCQ-LK
PhoBERT-base 55.75 53.18 53.34
PhoBERT-base-v2 57.67 54.46 55.70
VNLawBERT 56.87 52.35 54.19
ViLegalBERT 58.05 56.30 56.53

Extractive QA — ALQAC-EQA & ViBidLQA-EQA

Model EM F1
ALQAC-EQA
PhoBERT-base 37.25 64.46
PhoBERT-base-v2 40.58 68.01
VNLawBERT 38.82 65.85
ViLegalBERT 41.17 65.92
ViBidLQA-EQA
PhoBERT-base 49.34 77.45
PhoBERT-base-v2 50.66 78.59
VNLawBERT 48.49 76.32
ViLegalBERT 50.19 78.63
Natural Language Inference (click to expand)

NLI — VLSP-NLI

Model Precision Recall F1
PhoBERT-base 59.26 21.33 31.37
PhoBERT-base-v2 55.26 28.00 37.17
VNLawBERT 62.50 13.33 21.98
ViLegalBERT 72.28 97.33 82.95

Also in ViLegalLM

Model Architecture Params Context
ViLegalBERT (this model) Encoder-only 135M 256
ViLegalQwen2.5-1.5B-Base Decoder-only 1.54B 2048
ViLegalQwen3-1.7B-Base Decoder-only 1.72B 4096

Limitations and Biases

  • Domain scope: Trained exclusively on Vietnamese legal texts; may not generalize to other legal systems or jurisdictions.
  • Context length: Maximum 256 tokens, which may limit performance on long legal passages.
  • Temporal bias: Legal corpus reflects Vietnamese law as of the collection date; model outputs may not reflect recent legislative changes.
  • Inherited biases: May reflect biases present in the source legal corpora, including regional variations in legal practice and domain coverage imbalances.
  • Not a legal authority: Model outputs should never be used as definitive legal interpretations without expert validation.

Intended Use

Intended for:

  • Vietnamese legal information retrieval and document ranking
  • Legal question answering (True/False, Multiple Choice, Extractive)
  • Natural language inference on Vietnamese legal text
  • Feature extraction and semantic similarity for Vietnamese legal documents
  • Research on Vietnamese legal NLP

Not intended for:

  • Replacing professional legal counsel or human judgment in legal decision-making
  • Providing legal advice without expert validation
  • Legal systems outside Vietnam without careful domain adaptation

Citation

If you use ViLegalBERT, please cite both our paper and the original PhoBERT paper:

<!-- ViLegalLM citation — available soon -->
@inproceedings{phobert,
  title     = {{PhoBERT: Pre-trained language models for Vietnamese}},
  author    = {Dat Quoc Nguyen and Anh Tuan Nguyen},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020},
  pages     = {1037--1042}
}

License

AGPL-3.0

This model is derived from PhoBERT-base-v2 which is licensed under AGPL-3.0. As a derivative work, ViLegalBERT is also released under AGPL-3.0. Any downstream use, modification, or deployment as a service must comply with the AGPL-3.0 terms, including making source code publicly available.