bge-m3-law

A fine-tuned version of BAAI/bge-m3 specialized for Traditional Chinese legal document retrieval. Given a natural-language legal scenario query, this model retrieves the most relevant statutory articles from a corpus of Taiwan law.

Model Details

Field Value
Base model BAAI/bge-m3
Fine-tuning framework FlagEmbedding
Language Traditional Chinese (zh-TW)
Domain Taiwan statutory law
Task Dense retrieval / semantic similarity
Max sequence length 8192 tokens (inherited from bge-m3)
Embedding dimension 1024

Training Details

Training Data

  • Source: 390 legal scenario–article pairs
  • Generation: Scenarios (queries) were synthetically generated by GPT-5.1 from statutory article content
  • Corpus: 5,318 articles spanning multiple Taiwan laws
  • Format: Each training sample contains one query (legal scenario), one positive (the source article), and 9 hard negatives mined from the corpus

Hyperparameters

Parameter Value
Learning rate 1e-5
Epochs 5
Batch size per GPU 2
GPUs NVIDIA L40S × 4
train_group_size 10 (1 positive + 9 hard negatives)
Temperature 0.02
unified_finetuning True
use_self_distill True
Final train loss 0.2688

Evaluation

Evaluation protocol: 3 rounds × 200 randomly sampled legal scenarios (600 total), top-k retrieval against the 5,318-article corpus, top_k = 10.

Results on GPT-5.1 Scenarios (in-domain)

Model Hit Rate MRR Median Rank Worst Rank Std Dev
BAAI/bge-m3 (baseline) 67.17% 0.5001 1 10 1.89
bge-m3-law (this model) 88.50% 0.7058 1 10 1.69
Δ improvement +21.33% +0.2057 −0.20

Results on GPT-4o Scenarios (out-of-domain)

Model Hit Rate MRR Median Rank Worst Rank Std Dev
BAAI/bge-m3 (baseline) 72.67% 0.5746 1 10 1.80
bge-m3-law (this model) 84.50%

Hit Rate: proportion of queries where the ground-truth article appears within the top-k results. MRR: Mean Reciprocal Rank; non-hits contribute 0. Median / Worst Rank / Std Dev: computed over hits only.

Usage

With FlagEmbedding

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("eLAND-Research/bge-m3-law", use_fp16=True)

queries = ["甲向乙借款後未依約還款,乙可以主張什麼權利?"]
corpus  = ["債務人未履行債務時,債權人得請求強制執行。"]  # article text

q_emb = model.encode(queries, batch_size=12, max_length=512)["dense_vecs"]
c_emb = model.encode(corpus,  batch_size=12, max_length=512)["dense_vecs"]

scores = q_emb @ c_emb.T
print(scores)

With Sentence-Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("eLAND-Research/bge-m3-law")

query   = "甲向乙借款後未依約還款,乙可以主張什麼權利?"
article = "債務人未履行債務時,債權人得請求強制執行。"

embeddings = model.encode([query, article], normalize_embeddings=True)
score = embeddings[0] @ embeddings[1]
print(f"Similarity: {score:.4f}")

With OpenAI-Compatible API (LiteLLM / text-embeddings-inference)

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://your-deployment-endpoint/",
)

response = client.embeddings.create(
    model="bge-m3-law",
    input=["甲向乙借款後未依約還款,乙可以主張什麼權利?"],
)
embedding = response.data[0].embedding

Intended Use

  • Retrieval-Augmented Generation (RAG) pipelines for Taiwan legal Q&A systems
  • Legal article search: retrieving the most relevant statutory articles for a given case description or user query
  • Semantic similarity between legal texts in Traditional Chinese

Limitations

  • Optimized for Taiwan statutory law; performance on other legal systems or languages is not guaranteed
  • Very short articles (≤ 50 characters) remain hard to retrieve due to insufficient semantic content — a structural limitation independent of the model
  • Training data is synthetically generated; real user queries may exhibit different distributions

License

This model is released under the MIT License, consistent with the base model BAAI/bge-m3.

Citation

If you use this model in your research or product, please cite the base model:

@article{chen2024bge,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2402.03216},
  year={2024}
}

Developed By

eLAND Research — AI research team focused on legal technology applications.

Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eLAND-Research/bge-m3-law

Base model

BAAI/bge-m3
Finetuned
(386)
this model

Paper for eLAND-Research/bge-m3-law