bge-m3-law

A fine-tuned version of BAAI/bge-m3 specialized for Traditional Chinese legal document retrieval. Given a natural-language legal scenario query, this model retrieves the most relevant statutory articles from a corpus of Taiwan law.

Model Details

Field	Value
Base model	BAAI/bge-m3
Fine-tuning framework	FlagEmbedding
Language	Traditional Chinese (zh-TW)
Domain	Taiwan statutory law
Task	Dense retrieval / semantic similarity
Max sequence length	8192 tokens (inherited from bge-m3)
Embedding dimension	1024

Training Details

Training Data

Source: 390 legal scenario–article pairs
Generation: Scenarios (queries) were synthetically generated by GPT-5.1 from statutory article content
Corpus: 5,318 articles spanning multiple Taiwan laws
Format: Each training sample contains one query (legal scenario), one positive (the source article), and 9 hard negatives mined from the corpus

Hyperparameters

Parameter	Value
Learning rate	1e-5
Epochs	5
Batch size per GPU	2
GPUs	NVIDIA L40S × 4
`train_group_size`	10 (1 positive + 9 hard negatives)
Temperature	0.02
`unified_finetuning`	True
`use_self_distill`	True
Final train loss	0.2688

Evaluation

Evaluation protocol: 3 rounds × 200 randomly sampled legal scenarios (600 total), top-k retrieval against the 5,318-article corpus, top_k = 10.

Results on GPT-5.1 Scenarios (in-domain)

Model	Hit Rate	MRR	Median Rank	Worst Rank	Std Dev
BAAI/bge-m3 (baseline)	67.17%	0.5001	1	10	1.89
bge-m3-law (this model)	88.50%	0.7058	1	10	1.69
Δ improvement	+21.33%	+0.2057	—	—	−0.20

Results on GPT-4o Scenarios (out-of-domain)

Model	Hit Rate	MRR	Median Rank	Worst Rank	Std Dev
BAAI/bge-m3 (baseline)	72.67%	0.5746	1	10	1.80
bge-m3-law (this model)	84.50%	—	—	—	—

Hit Rate: proportion of queries where the ground-truth article appears within the top-k results. MRR: Mean Reciprocal Rank; non-hits contribute 0. Median / Worst Rank / Std Dev: computed over hits only.

Usage

With FlagEmbedding

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("eLAND-Research/bge-m3-law", use_fp16=True)

queries = ["甲向乙借款後未依約還款，乙可以主張什麼權利？"]
corpus  = ["債務人未履行債務時，債權人得請求強制執行。"]  # article text

q_emb = model.encode(queries, batch_size=12, max_length=512)["dense_vecs"]
c_emb = model.encode(corpus,  batch_size=12, max_length=512)["dense_vecs"]

scores = q_emb @ c_emb.T
print(scores)

With Sentence-Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("eLAND-Research/bge-m3-law")

query   = "甲向乙借款後未依約還款，乙可以主張什麼權利？"
article = "債務人未履行債務時，債權人得請求強制執行。"

embeddings = model.encode([query, article], normalize_embeddings=True)
score = embeddings[0] @ embeddings[1]
print(f"Similarity: {score:.4f}")

With OpenAI-Compatible API (LiteLLM / text-embeddings-inference)

from openai import OpenAI

client = OpenAI(
    api_key="your-api-key",
    base_url="https://your-deployment-endpoint/",
)

response = client.embeddings.create(
    model="bge-m3-law",
    input=["甲向乙借款後未依約還款，乙可以主張什麼權利？"],
)
embedding = response.data[0].embedding

Intended Use

Retrieval-Augmented Generation (RAG) pipelines for Taiwan legal Q&A systems
Legal article search: retrieving the most relevant statutory articles for a given case description or user query
Semantic similarity between legal texts in Traditional Chinese

Limitations

Optimized for Taiwan statutory law; performance on other legal systems or languages is not guaranteed
Very short articles (≤ 50 characters) remain hard to retrieve due to insufficient semantic content — a structural limitation independent of the model
Training data is synthetically generated; real user queries may exhibit different distributions

License

This model is released under the MIT License, consistent with the base model BAAI/bge-m3.

Citation

If you use this model in your research or product, please cite the base model:

@article{chen2024bge,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2402.03216},
  year={2024}
}

Developed By

eLAND Research — AI research team focused on legal technology applications.

Downloads last month: -

Safetensors

Model size

0.6B params

Tensor type

F16

Model tree for eLAND-Research/bge-m3-law

Base model

BAAI/bge-m3

Finetuned

(386)

this model

Paper for eLAND-Research/bge-m3-law

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Paper • 2402.03216 • Published Feb 5, 2024 • 6