bge-m3-law
A fine-tuned version of BAAI/bge-m3 specialized for Traditional Chinese legal document retrieval. Given a natural-language legal scenario query, this model retrieves the most relevant statutory articles from a corpus of Taiwan law.
Model Details
| Field | Value |
|---|---|
| Base model | BAAI/bge-m3 |
| Fine-tuning framework | FlagEmbedding |
| Language | Traditional Chinese (zh-TW) |
| Domain | Taiwan statutory law |
| Task | Dense retrieval / semantic similarity |
| Max sequence length | 8192 tokens (inherited from bge-m3) |
| Embedding dimension | 1024 |
Training Details
Training Data
- Source: 390 legal scenario–article pairs
- Generation: Scenarios (queries) were synthetically generated by GPT-5.1 from statutory article content
- Corpus: 5,318 articles spanning multiple Taiwan laws
- Format: Each training sample contains one query (legal scenario), one positive (the source article), and 9 hard negatives mined from the corpus
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 1e-5 |
| Epochs | 5 |
| Batch size per GPU | 2 |
| GPUs | NVIDIA L40S × 4 |
train_group_size |
10 (1 positive + 9 hard negatives) |
| Temperature | 0.02 |
unified_finetuning |
True |
use_self_distill |
True |
| Final train loss | 0.2688 |
Evaluation
Evaluation protocol: 3 rounds × 200 randomly sampled legal scenarios (600 total), top-k retrieval against the 5,318-article corpus, top_k = 10.
Results on GPT-5.1 Scenarios (in-domain)
| Model | Hit Rate | MRR | Median Rank | Worst Rank | Std Dev |
|---|---|---|---|---|---|
| BAAI/bge-m3 (baseline) | 67.17% | 0.5001 | 1 | 10 | 1.89 |
| bge-m3-law (this model) | 88.50% | 0.7058 | 1 | 10 | 1.69 |
| Δ improvement | +21.33% | +0.2057 | — | — | −0.20 |
Results on GPT-4o Scenarios (out-of-domain)
| Model | Hit Rate | MRR | Median Rank | Worst Rank | Std Dev |
|---|---|---|---|---|---|
| BAAI/bge-m3 (baseline) | 72.67% | 0.5746 | 1 | 10 | 1.80 |
| bge-m3-law (this model) | 84.50% | — | — | — | — |
Hit Rate: proportion of queries where the ground-truth article appears within the top-k results. MRR: Mean Reciprocal Rank; non-hits contribute 0. Median / Worst Rank / Std Dev: computed over hits only.
Usage
With FlagEmbedding
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("eLAND-Research/bge-m3-law", use_fp16=True)
queries = ["甲向乙借款後未依約還款,乙可以主張什麼權利?"]
corpus = ["債務人未履行債務時,債權人得請求強制執行。"] # article text
q_emb = model.encode(queries, batch_size=12, max_length=512)["dense_vecs"]
c_emb = model.encode(corpus, batch_size=12, max_length=512)["dense_vecs"]
scores = q_emb @ c_emb.T
print(scores)
With Sentence-Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("eLAND-Research/bge-m3-law")
query = "甲向乙借款後未依約還款,乙可以主張什麼權利?"
article = "債務人未履行債務時,債權人得請求強制執行。"
embeddings = model.encode([query, article], normalize_embeddings=True)
score = embeddings[0] @ embeddings[1]
print(f"Similarity: {score:.4f}")
With OpenAI-Compatible API (LiteLLM / text-embeddings-inference)
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="https://your-deployment-endpoint/",
)
response = client.embeddings.create(
model="bge-m3-law",
input=["甲向乙借款後未依約還款,乙可以主張什麼權利?"],
)
embedding = response.data[0].embedding
Intended Use
- Retrieval-Augmented Generation (RAG) pipelines for Taiwan legal Q&A systems
- Legal article search: retrieving the most relevant statutory articles for a given case description or user query
- Semantic similarity between legal texts in Traditional Chinese
Limitations
- Optimized for Taiwan statutory law; performance on other legal systems or languages is not guaranteed
- Very short articles (≤ 50 characters) remain hard to retrieve due to insufficient semantic content — a structural limitation independent of the model
- Training data is synthetically generated; real user queries may exhibit different distributions
License
This model is released under the MIT License, consistent with the base model BAAI/bge-m3.
Citation
If you use this model in your research or product, please cite the base model:
@article{chen2024bge,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and Luo, Kun and Lian, Defu and Liu, Zheng},
journal={arXiv preprint arXiv:2402.03216},
year={2024}
}
Developed By
eLAND Research — AI research team focused on legal technology applications.
- Downloads last month
- -
Model tree for eLAND-Research/bge-m3-law
Base model
BAAI/bge-m3