Multilingual Finance Term Matching Model (MiniLM)
A multilingual sentence embedding model fine-tuned for cross-lingual financial terminology matching.
🧠 Model Description
This model is fine-tuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 to improve semantic similarity between financial terms across languages.
It is designed to map bilingual terminology (e.g., Chinese–English term pairs) into a shared embedding space, enabling efficient cross-lingual retrieval and alignment.
🎯 Intended Use
This model is suitable for:
- Cross-lingual terminology matching
- Financial term alignment (ZH ↔ EN)
- Translation memory enhancement
- Semantic search / retrieval
- Terminology normalization
🚫 Out-of-Scope Use
This model is not intended for:
- Long document semantic comparison
- General-purpose multilingual understanding
- High-risk decision systems without human validation
📊 Training Data
- Source: WMT 2025 terminology dataset
- Format: bilingual term pairs (Chinese ↔ English)
- Structure: CSV file with two columns
- Column 1: Chinese term
- Column 2: English term
No explicit labels or negative samples are required.
🏗️ Training Pipeline
The model is fine-tuned using the Sentence-Transformers v3 training API.
Data Processing
- Dataset loaded via Hugging Face
datasets - Only first two columns used (anchor, positive)
- Dataset shuffled with fixed seed
- Optional train/dev split (
dev_ratio=0.1)
⚙️ Training Details
- Base model:
paraphrase-multilingual-MiniLM-L12-v2 - Framework: Sentence-Transformers
- Loss:
MultipleNegativesRankingLoss
Key Mechanism
- Uses in-batch negatives
- Each batch:
- positive pair = (zh_term, en_term)
- all other pairs = implicit negatives
🧪 Training Configuration
- num_epochs: 5–10
- batch_size: 32–64
- learning_rate: 2e-5
- warmup_ratio: 0.1
Training Strategy
- evaluation: per epoch (if dev set provided)
- checkpoint saving: per epoch
- best model selection: enabled when validation exists
- no-duplicate batch sampling
⚡ Hardware & HPC Setup
This training pipeline is designed for offline HPC environments:
- Custom Hugging Face cache (
HF_HOME) - Offline mode enabled:
HF_HUB_OFFLINE=1TRANSFORMERS_OFFLINE=1
- Optional GPU acceleration (fp16)
🚀 Usage
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("yourname/minilm-finance-term-matcher")
emb1 = model.encode("可转换债券")
emb2 = model.encode("convertible bond")
score = util.cos_sim(emb1, emb2)
print(score)
🧾 Example
Input:
"可转换债券"
"convertible bond"
Output:
cosine similarity ≈ high (close match)
🔗 Integration with Term Extraction
This model is designed to work together with:
BERT-based term extractors (EN & ZH)
Pipeline:
Extract terms from text
Encode terms
Perform cross-lingual matching
⚠️ Limitations
Domain-specific (finance)
Performance depends on term coverage
Not optimized for long sentences
Sensitive to domain shift
📜 License
This model is trained on data licensed under CC BY-NC 4.0:
✅ Non-commercial use allowed
❌ Commercial use restricted
✅ Attribution required
The base model is Apache 2.0, but fine-tuned weights inherit dataset restrictions.
🙏 Acknowledgements
Base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Dataset: WMT 2025 terminology resources
Framework: Sentence-Transformers
Training API: SentenceTransformerTrainer
- Downloads last month
- 15