Multilingual Finance Term Matching Model (MiniLM)

A multilingual sentence embedding model fine-tuned for cross-lingual financial terminology matching.


🧠 Model Description

This model is fine-tuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 to improve semantic similarity between financial terms across languages.

It is designed to map bilingual terminology (e.g., Chinese–English term pairs) into a shared embedding space, enabling efficient cross-lingual retrieval and alignment.


🎯 Intended Use

This model is suitable for:

  • Cross-lingual terminology matching
  • Financial term alignment (ZH ↔ EN)
  • Translation memory enhancement
  • Semantic search / retrieval
  • Terminology normalization

🚫 Out-of-Scope Use

This model is not intended for:

  • Long document semantic comparison
  • General-purpose multilingual understanding
  • High-risk decision systems without human validation

📊 Training Data

  • Source: WMT 2025 terminology dataset
  • Format: bilingual term pairs (Chinese ↔ English)
  • Structure: CSV file with two columns
    • Column 1: Chinese term
    • Column 2: English term

No explicit labels or negative samples are required.


🏗️ Training Pipeline

The model is fine-tuned using the Sentence-Transformers v3 training API.

Data Processing

  • Dataset loaded via Hugging Face datasets
  • Only first two columns used (anchor, positive)
  • Dataset shuffled with fixed seed
  • Optional train/dev split (dev_ratio=0.1)

⚙️ Training Details

  • Base model: paraphrase-multilingual-MiniLM-L12-v2
  • Framework: Sentence-Transformers
  • Loss: MultipleNegativesRankingLoss

Key Mechanism

  • Uses in-batch negatives
  • Each batch:
    • positive pair = (zh_term, en_term)
    • all other pairs = implicit negatives

🧪 Training Configuration

  • num_epochs: 5–10
  • batch_size: 32–64
  • learning_rate: 2e-5
  • warmup_ratio: 0.1

Training Strategy

  • evaluation: per epoch (if dev set provided)
  • checkpoint saving: per epoch
  • best model selection: enabled when validation exists
  • no-duplicate batch sampling

⚡ Hardware & HPC Setup

This training pipeline is designed for offline HPC environments:

  • Custom Hugging Face cache (HF_HOME)
  • Offline mode enabled:
    • HF_HUB_OFFLINE=1
    • TRANSFORMERS_OFFLINE=1
  • Optional GPU acceleration (fp16)

🚀 Usage

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("yourname/minilm-finance-term-matcher")

emb1 = model.encode("可转换债券")
emb2 = model.encode("convertible bond")

score = util.cos_sim(emb1, emb2)
print(score)
🧾 Example

Input:

"可转换债券"
"convertible bond"

Output:

cosine similarity ≈ high (close match)
🔗 Integration with Term Extraction

This model is designed to work together with:

BERT-based term extractors (EN & ZH)

Pipeline:

Extract terms from text
Encode terms
Perform cross-lingual matching
⚠️ Limitations
Domain-specific (finance)
Performance depends on term coverage
Not optimized for long sentences
Sensitive to domain shift
📜 License

This model is trained on data licensed under CC BY-NC 4.0:

✅ Non-commercial use allowed
❌ Commercial use restricted
✅ Attribution required

The base model is Apache 2.0, but fine-tuned weights inherit dataset restrictions.

🙏 Acknowledgements
Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Dataset: WMT 2025 terminology resources
Framework: Sentence-Transformers
Training API: SentenceTransformerTrainer
Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for owen4512/minilm-finance-term-aligner