eland-embedding-zh-v2
Chinese domain embedding model based on BAAI/bge-m3 with two-stage contrastive fine-tuning, optimized for Taiwan legal, finance, and news domain retrieval and clustering.
Highlights
- Recall@5 = 100% on Taiwan law article retrieval (41 queries, 1,343 laws)
- MRR = 0.903 (vs baseline bge-m3 0.761, +18.7%)
- Multilingual (Chinese + English), up to 8,192 tokens
- Suitable for RAG, semantic search, text clustering, topic modeling
Model Details
| Property | Value |
|---|---|
| Base Model | BAAI/bge-m3 (XLM-RoBERTa) |
| Parameters | 568M |
| Hidden Size | 1024 |
| Max Length | 8,192 tokens |
| Pooling | CLS token |
| Normalization | L2 |
| Training Method | Contrastive (InfoNCE) + Self-Distillation |
Training
Two-Stage Fine-tuning
Stage 1 - Contrastive v1:
- InfoNCE loss with in-batch negatives + SimCSE dropout augmentation
- Data: unsupervised corpus (finance, legal, news, opinion, laws, mops) + supervised labels + synthetic query-passage pairs
- Temperature: 0.05
Stage 2 - Contrastive v2 (this model):
- InfoNCE with ANCE-style hard negatives mined from v1 model
- Self-knowledge distillation: KL divergence between student (v2) and teacher (original bge-m3) to prevent catastrophic forgetting
- 7,999 hard negative triplets (avg 7 negatives per query)
- Temperature: 0.02 (sharper distinction)
- Distillation alpha: 0.3 (30% distillation + 70% contrastive)
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 3 |
| Batch Size | 6 x 4 GPUs |
| Gradient Accumulation | 20 steps |
| Effective Batch Size | 480 |
| Learning Rate | 3e-6 (cosine decay) |
| Warmup | 10% |
| Precision | FP16 |
| Hardware | 4x NVIDIA L40S (46GB) |
| Training Time | ~6 hours |
Data Mix
| Source | Ratio | Description |
|---|---|---|
| Unsupervised SimCSE | 40% | Corpus passages with dropout augmentation |
| Supervised | 10% | Sentiment and entity sentiment labels |
| Query-Passage | 20% | 7,958 synthetic query-passage pairs |
| Hard Negatives | 30% | 7,999 ANCE-mined triplets |
Results
Legal Article Retrieval (Taiwan)
1,343 Taiwan law articles, 41 natural language queries with ground truth.
| Metric | Baseline (bge-m3) | v2 (this model) | Improvement |
|---|---|---|---|
| Recall@5 | 87.8% | 100% | +12.2% |
| Recall@10 | 95.1% | 100% | +4.9% |
| MRR | 0.761 | 0.903 | +18.7% |
Domain Clustering
600 documents from 4 domains (finance, legal, news, opinion).
| Metric | Baseline (bge-m3) | v2 (this model) | Improvement |
|---|---|---|---|
| ARI | 0.130 | 0.159 | +22.8% |
| V-measure | 0.149 | 0.171 | +15.1% |
Similarity Distribution
| Metric | Baseline | v2 |
|---|---|---|
| Mean similarity | 0.402 | 0.466 |
| Std deviation | 0.081 | 0.075 |
v2 produces tighter, more discriminative similarity distributions.
Usage
Transformers
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer
model_name = "p988744/eland-embedding-zh-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
def encode(texts, max_length=512):
with torch.no_grad():
enc = tokenizer(texts, max_length=max_length, padding=True,
truncation=True, return_tensors="pt")
out = model(**enc)
emb = out.last_hidden_state[:, 0, :] # CLS pooling
return F.normalize(emb, p=2, dim=-1)
# Encode
query = encode(["What are the overtime pay regulations?"])
docs = encode(["Labor Standards Act Article 24...", "Company Act Article 8..."])
# Cosine similarity (embeddings are L2-normalized)
similarities = (query @ docs.T).squeeze()
print(similarities) # tensor([0.82, 0.31])
LiteLLM / OpenAI API
from openai import OpenAI
client = OpenAI(
base_url="https://your-litellm-proxy/v1",
api_key="your-key"
)
response = client.embeddings.create(
model="eland-embedding-zh-v2",
input=["What are the overtime pay regulations?"]
)
embedding = response.data[0].embedding # 1024-dim
Domains
Training data covers:
- Finance: financial news, research reports, stock discussions
- Legal: law articles, court judgments, legal news
- News: general news, international news
- Opinion: PTT, Dcard, forum discussions
- Laws: Taiwan national law database
- MOPS: listed company material information
Limitations
- Optimized for Traditional/Simplified Chinese; may underperform on other languages
- Best results with CLS pooling + L2 normalization
- Recommended max length is 512 tokens for best quality (supports up to 8,192)
- Clustering metrics (ARI, V-measure) still moderate as domain boundaries in Chinese text are inherently fuzzy
Citation
@misc{eland-embedding-zh-v2,
title={eland-embedding-zh-v2: Chinese Domain Embedding with Hard Negative Mining and Self-Distillation},
author={Eland AI},
year={2026},
url={https://huggingface.co/p988744/eland-embedding-zh-v2}
}
License
MIT
- Downloads last month
- 25
Model tree for p988744/eland-embedding-zh-v2
Base model
BAAI/bge-m3Datasets used to train p988744/eland-embedding-zh-v2
Evaluation results
- Recall@5 on Taiwan Law Articles (1,343 laws, 41 queries)self-reported1.000
- Recall@10 on Taiwan Law Articles (1,343 laws, 41 queries)self-reported1.000
- MRR on Taiwan Law Articles (1,343 laws, 41 queries)self-reported0.903
- ARI on Multi-domain Chinese Text (finance, legal, news, opinion)self-reported0.159
- V-measure on Multi-domain Chinese Text (finance, legal, news, opinion)self-reported0.171