eland-embedding-zh-v2

Chinese domain embedding model based on BAAI/bge-m3 with two-stage contrastive fine-tuning, optimized for Taiwan legal, finance, and news domain retrieval and clustering.

Highlights

Recall@5 = 100% on Taiwan law article retrieval (41 queries, 1,343 laws)
MRR = 0.903 (vs baseline bge-m3 0.761, +18.7%)
Multilingual (Chinese + English), up to 8,192 tokens
Suitable for RAG, semantic search, text clustering, topic modeling

Model Details

Property	Value
Base Model	BAAI/bge-m3 (XLM-RoBERTa)
Parameters	568M
Hidden Size	1024
Max Length	8,192 tokens
Pooling	CLS token
Normalization	L2
Training Method	Contrastive (InfoNCE) + Self-Distillation

Training

Two-Stage Fine-tuning

Stage 1 - Contrastive v1:

InfoNCE loss with in-batch negatives + SimCSE dropout augmentation
Data: unsupervised corpus (finance, legal, news, opinion, laws, mops) + supervised labels + synthetic query-passage pairs
Temperature: 0.05

Stage 2 - Contrastive v2 (this model):

InfoNCE with ANCE-style hard negatives mined from v1 model
Self-knowledge distillation: KL divergence between student (v2) and teacher (original bge-m3) to prevent catastrophic forgetting
7,999 hard negative triplets (avg 7 negatives per query)
Temperature: 0.02 (sharper distinction)
Distillation alpha: 0.3 (30% distillation + 70% contrastive)

Training Configuration

Parameter	Value
Epochs	3
Batch Size	6 x 4 GPUs
Gradient Accumulation	20 steps
Effective Batch Size	480
Learning Rate	3e-6 (cosine decay)
Warmup	10%
Precision	FP16
Hardware	4x NVIDIA L40S (46GB)
Training Time	~6 hours

Data Mix

Source	Ratio	Description
Unsupervised SimCSE	40%	Corpus passages with dropout augmentation
Supervised	10%	Sentiment and entity sentiment labels
Query-Passage	20%	7,958 synthetic query-passage pairs
Hard Negatives	30%	7,999 ANCE-mined triplets

Results

Legal Article Retrieval (Taiwan)

1,343 Taiwan law articles, 41 natural language queries with ground truth.

Metric	Baseline (bge-m3)	v2 (this model)	Improvement
Recall@5	87.8%	100%	+12.2%
Recall@10	95.1%	100%	+4.9%
MRR	0.761	0.903	+18.7%

Domain Clustering

600 documents from 4 domains (finance, legal, news, opinion).

Metric	Baseline (bge-m3)	v2 (this model)	Improvement
ARI	0.130	0.159	+22.8%
V-measure	0.149	0.171	+15.1%

Similarity Distribution

Metric	Baseline	v2
Mean similarity	0.402	0.466
Std deviation	0.081	0.075

v2 produces tighter, more discriminative similarity distributions.

Usage

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_name = "p988744/eland-embedding-zh-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def encode(texts, max_length=512):
    with torch.no_grad():
        enc = tokenizer(texts, max_length=max_length, padding=True,
                        truncation=True, return_tensors="pt")
        out = model(**enc)
        emb = out.last_hidden_state[:, 0, :]  # CLS pooling
        return F.normalize(emb, p=2, dim=-1)

# Encode
query = encode(["What are the overtime pay regulations?"])
docs = encode(["Labor Standards Act Article 24...", "Company Act Article 8..."])

# Cosine similarity (embeddings are L2-normalized)
similarities = (query @ docs.T).squeeze()
print(similarities)  # tensor([0.82, 0.31])

LiteLLM / OpenAI API

from openai import OpenAI

client = OpenAI(
    base_url="https://your-litellm-proxy/v1",
    api_key="your-key"
)

response = client.embeddings.create(
    model="eland-embedding-zh-v2",
    input=["What are the overtime pay regulations?"]
)
embedding = response.data[0].embedding  # 1024-dim

Domains

Training data covers:

Finance: financial news, research reports, stock discussions
Legal: law articles, court judgments, legal news
News: general news, international news
Opinion: PTT, Dcard, forum discussions
Laws: Taiwan national law database
MOPS: listed company material information

Limitations

Optimized for Traditional/Simplified Chinese; may underperform on other languages
Best results with CLS pooling + L2 normalization
Recommended max length is 512 tokens for best quality (supports up to 8,192)
Clustering metrics (ARI, V-measure) still moderate as domain boundaries in Chinese text are inherently fuzzy

Citation

@misc{eland-embedding-zh-v2,
  title={eland-embedding-zh-v2: Chinese Domain Embedding with Hard Negative Mining and Self-Distillation},
  author={Eland AI},
  year={2026},
  url={https://huggingface.co/p988744/eland-embedding-zh-v2}
}

License

MIT

Downloads last month: 25

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for p988744/eland-embedding-zh-v2

Base model

BAAI/bge-m3

Finetuned

(414)

this model

Datasets used to train p988744/eland-embedding-zh-v2

Evaluation results

Recall@5 on Taiwan Law Articles (1,343 laws, 41 queries)
self-reported

1.000
Recall@10 on Taiwan Law Articles (1,343 laws, 41 queries)
self-reported

1.000
MRR on Taiwan Law Articles (1,343 laws, 41 queries)
self-reported

0.903
ARI on Multi-domain Chinese Text (finance, legal, news, opinion)
self-reported

0.159
V-measure on Multi-domain Chinese Text (finance, legal, news, opinion)
self-reported

0.171