eland-embedding-zh-v2

Chinese domain embedding model based on BAAI/bge-m3 with two-stage contrastive fine-tuning, optimized for Taiwan legal, finance, and news domain retrieval and clustering.

Highlights

  • Recall@5 = 100% on Taiwan law article retrieval (41 queries, 1,343 laws)
  • MRR = 0.903 (vs baseline bge-m3 0.761, +18.7%)
  • Multilingual (Chinese + English), up to 8,192 tokens
  • Suitable for RAG, semantic search, text clustering, topic modeling

Model Details

Property Value
Base Model BAAI/bge-m3 (XLM-RoBERTa)
Parameters 568M
Hidden Size 1024
Max Length 8,192 tokens
Pooling CLS token
Normalization L2
Training Method Contrastive (InfoNCE) + Self-Distillation

Training

Two-Stage Fine-tuning

Stage 1 - Contrastive v1:

  • InfoNCE loss with in-batch negatives + SimCSE dropout augmentation
  • Data: unsupervised corpus (finance, legal, news, opinion, laws, mops) + supervised labels + synthetic query-passage pairs
  • Temperature: 0.05

Stage 2 - Contrastive v2 (this model):

  • InfoNCE with ANCE-style hard negatives mined from v1 model
  • Self-knowledge distillation: KL divergence between student (v2) and teacher (original bge-m3) to prevent catastrophic forgetting
  • 7,999 hard negative triplets (avg 7 negatives per query)
  • Temperature: 0.02 (sharper distinction)
  • Distillation alpha: 0.3 (30% distillation + 70% contrastive)

Training Configuration

Parameter Value
Epochs 3
Batch Size 6 x 4 GPUs
Gradient Accumulation 20 steps
Effective Batch Size 480
Learning Rate 3e-6 (cosine decay)
Warmup 10%
Precision FP16
Hardware 4x NVIDIA L40S (46GB)
Training Time ~6 hours

Data Mix

Source Ratio Description
Unsupervised SimCSE 40% Corpus passages with dropout augmentation
Supervised 10% Sentiment and entity sentiment labels
Query-Passage 20% 7,958 synthetic query-passage pairs
Hard Negatives 30% 7,999 ANCE-mined triplets

Results

Legal Article Retrieval (Taiwan)

1,343 Taiwan law articles, 41 natural language queries with ground truth.

Metric Baseline (bge-m3) v2 (this model) Improvement
Recall@5 87.8% 100% +12.2%
Recall@10 95.1% 100% +4.9%
MRR 0.761 0.903 +18.7%

Domain Clustering

600 documents from 4 domains (finance, legal, news, opinion).

Metric Baseline (bge-m3) v2 (this model) Improvement
ARI 0.130 0.159 +22.8%
V-measure 0.149 0.171 +15.1%

Similarity Distribution

Metric Baseline v2
Mean similarity 0.402 0.466
Std deviation 0.081 0.075

v2 produces tighter, more discriminative similarity distributions.

Usage

Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model_name = "p988744/eland-embedding-zh-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def encode(texts, max_length=512):
    with torch.no_grad():
        enc = tokenizer(texts, max_length=max_length, padding=True,
                        truncation=True, return_tensors="pt")
        out = model(**enc)
        emb = out.last_hidden_state[:, 0, :]  # CLS pooling
        return F.normalize(emb, p=2, dim=-1)

# Encode
query = encode(["What are the overtime pay regulations?"])
docs = encode(["Labor Standards Act Article 24...", "Company Act Article 8..."])

# Cosine similarity (embeddings are L2-normalized)
similarities = (query @ docs.T).squeeze()
print(similarities)  # tensor([0.82, 0.31])

LiteLLM / OpenAI API

from openai import OpenAI

client = OpenAI(
    base_url="https://your-litellm-proxy/v1",
    api_key="your-key"
)

response = client.embeddings.create(
    model="eland-embedding-zh-v2",
    input=["What are the overtime pay regulations?"]
)
embedding = response.data[0].embedding  # 1024-dim

Domains

Training data covers:

  • Finance: financial news, research reports, stock discussions
  • Legal: law articles, court judgments, legal news
  • News: general news, international news
  • Opinion: PTT, Dcard, forum discussions
  • Laws: Taiwan national law database
  • MOPS: listed company material information

Limitations

  • Optimized for Traditional/Simplified Chinese; may underperform on other languages
  • Best results with CLS pooling + L2 normalization
  • Recommended max length is 512 tokens for best quality (supports up to 8,192)
  • Clustering metrics (ARI, V-measure) still moderate as domain boundaries in Chinese text are inherently fuzzy

Citation

@misc{eland-embedding-zh-v2,
  title={eland-embedding-zh-v2: Chinese Domain Embedding with Hard Negative Mining and Self-Distillation},
  author={Eland AI},
  year={2026},
  url={https://huggingface.co/p988744/eland-embedding-zh-v2}
}

License

MIT

Downloads last month
25
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for p988744/eland-embedding-zh-v2

Base model

BAAI/bge-m3
Finetuned
(414)
this model

Datasets used to train p988744/eland-embedding-zh-v2

Evaluation results

  • Recall@5 on Taiwan Law Articles (1,343 laws, 41 queries)
    self-reported
    1.000
  • Recall@10 on Taiwan Law Articles (1,343 laws, 41 queries)
    self-reported
    1.000
  • MRR on Taiwan Law Articles (1,343 laws, 41 queries)
    self-reported
    0.903
  • ARI on Multi-domain Chinese Text (finance, legal, news, opinion)
    self-reported
    0.159
  • V-measure on Multi-domain Chinese Text (finance, legal, news, opinion)
    self-reported
    0.171