Sentence Similarity
sentence-transformers
Safetensors
PEFT
Korean
qwen3
feature-extraction
text-embedding
information-retrieval
korean
finance
lora
text-embeddings-inference
Instructions to use BCCard/MoAI-Embedding-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use BCCard/MoAI-Embedding-4B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("BCCard/MoAI-Embedding-4B") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - PEFT
How to use BCCard/MoAI-Embedding-4B with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 10,730 Bytes
bd15a2c 6235d35 bd15a2c 52ba794 646b3bf 6235d35 646b3bf 52ba794 646b3bf bd15a2c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | ---
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: Qwen/Qwen3-Embedding-4B
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- text-embedding
- information-retrieval
- korean
- finance
- lora
- peft
datasets:
- BCCard/BCAI-Finance-Kor-Embedding-Triplet
- BCCard/BCAI-Finance-Kor-Embedding-Pair
metrics:
- ndcg
- mrr
- recall
---
# 1. Overview
A Korean text-embedding model for the **BC Card domain**, built by LoRA fine-tuning
[`Qwen/Qwen3-Embedding-4B`](https://huggingface.co/Qwen/Qwen3-Embedding-4B) on BC Card in-domain data (personal / merchant / corporate / VIP). It is intended as the **retriever (bi-encoder)** stage of a BC Card RAG pipeline.
This is the **4B-scale** sibling of [`BCCard/MoAI-Embedding-0.6B`](https://huggingface.co/BCCard/MoAI-Embedding-0.6B) โ a larger-capacity variant for higher retrieval quality at the cost of compute/latency.
On a held-out in-domain test set it improves **NDCG@10 by +6.1%** and **Accuracy@1 by +8.9%** over the base `Qwen3-Embedding-4B` (full metrics in ยง2.3).
## 1.1. TL;DR
* **Base model**: [`Qwen/Qwen3-Embedding-4B`](https://huggingface.co/Qwen/Qwen3-Embedding-4B) โ 36 layers, hidden 2560, last-token pooling, instruction-aware
* **Domain / Language**: Finance (BC Card โ personal / merchant / corporate / VIP) / Korean
* **Task**: Query-document retrieval (QA search, document similarity), RAG retriever
* **Method**: PEFT (LoRA) + Multiple Negatives Ranking (contrastive)
* **Format**: merged standalone (LoRA fused into base; loads with `sentence-transformers`, no `peft`)
* **Embedding dimension**: 2560 ยท **Max sequence length**: 1024 ยท **Similarity**: cosine (outputs are L2-normalized)
* **Intended use**
- In-house **BC Card-domain RAG retriever** (Top-K candidate retrieval)
- QA search, document-similarity scoring
## 1.2. Usage
The model was trained with an **instruction prefix on the query side only** (documents get no
instruction). Inject the same instruction at inference so query/document encoding matches training.
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BCCard/MoAI-Embedding-4B")
# Query-side instruction (identical to training) - prepend to every query at inference time
QUERY_INSTRUCTION = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
queries = ["BC์นด๋ ์ฐํ๋น๋ ์ด๋ป๊ฒ ๋๋์?"]
documents = [
"BC์นด๋ ์ฐํ๋น๋ ์นด๋ ์ข
๋ฅ์ ํํ ๊ตฌ์ฑ์ ๋ฐ๋ผ ๋ค๋ฅด๊ฒ ์ฑ
์ ๋ฉ๋๋ค ...",
"๋ฐ๋ก์นด๋ ์ฐํ๋น๋ ๊ตญ๋ด ์ ์ฉ๊ณผ ํด์ธ ๊ฒธ์ฉ ์ฌ๋ถ์ ๋ฐ๋ผ ์ฐจ๋ฑ ๋ถ๊ณผ๋ฉ๋๋ค ...",
"์ ์ ์ค์ ๋ฑ ์กฐ๊ฑด์ ์ถฉ์กฑํ๋ฉด ๋ค์ ํด ์ฐํ๋น๊ฐ ๋ฉด์ ๋๋ ์นด๋๋ ์์ต๋๋ค ...",
"์นด๋ ๋ถ์ค ์ ๊ณ ๋ ๊ณ ๊ฐ์ผํฐ ๋๋ ์ฑ์์ ์ฆ์ ๊ฐ๋ฅํฉ๋๋ค ...",
...
]
# Queries: inject the instruction ยท Documents: no instruction
q_emb = model.encode(queries, prompt=QUERY_INSTRUCTION)
d_emb = model.encode(documents)
scores = model.similarity(q_emb, d_emb) # cosine; rank documents by score
print(scores)
```
> The instruction is also stored in the model config, so `model.encode(queries, prompt_name="query")`
> is equivalent to passing `prompt=QUERY_INSTRUCTION` explicitly. Documents use no prompt
> (`prompt_name="document"` is an empty string).
* **Query prompt** (instruction): `Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: `
* **Document prompt**: none
## 1.3. Training Data
| Dataset | Role | Size |
|---------|------|------|
| [BCAI-Finance-Kor-Embedding-Triplet](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Triplet) | Training (anchor / positive / negative) | 43,394 triplets (train) |
| [BCAI-Finance-Kor-Embedding-Pair](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Pair) | Corpus pool / evaluation | 36,281 unique chunks |
* Sources: BC Card financial QA (BCAI) + website crawl + synthetic data (chunking + multi-query generation)
* Triplets are constructed via **hard-negative mining** over the unified corpus.
## 1.4. Training Procedure
| Item | Value |
|------|-------|
| Method | LoRA (PEFT) |
| LoRA | r=64, alpha=128, dropout=0.05, targets = q,k,v,o,gate,up,down_proj |
| Loss | CachedMultipleNegativesRankingLoss (in-batch negatives) |
| Batch | per-device 256 (DDP) โ 511 in-batch negatives per rank |
| LR / scheduler | 5e-5 / cosine, warmup_ratio 0.1, weight_decay 0.01 |
| Epochs | 3, early stopping โ best checkpoint selected by validation NDCG@10 |
| Precision | bf16, gradient checkpointing |
| Hardware | 8ร NVIDIA RTX PRO 6000 Blackwell (DDP) |
<br>
# 2. Evaluation
## 2.1. Setup
* **Queries**: 1,000 (held-out test split) ยท **Corpus**: 36,281 unique chunks
* **Protocol**: binary-relevance information retrieval; the same evaluator used during training
* **Metrics**: NDCG@10 (primary), MRR@10, Recall@{1,10}, Accuracy@1, MAP@10
* **Models compared**: base (`Qwen3-Embedding-4B`, no fine-tuning) vs. **v4 (r64 / lr5e-5 / 3ep, released)**
<br>
## 2.2. Training
<div align="center">
<img src="figures/evaluation-train-1-1.png" alt="Training curves - loss, learning rate, validation NDCG@10 (WandB)" >
</div>
Trained for 3 epochs (early-stopped) with a cosine schedule; training loss decreases steadily while validation NDCG@10 climbs early and plateaus (peak โ 0.695 around epoch ~1.4), and the best checkpoint is selected at the peak. Curves (loss / learning rate / validation NDCG@10) are logged to Weights & Biases.
<br>
## 2.3. In-domain Retrieval Benchmark
<div align="center">
<img src="figures/evaluation-test-1-1.png" alt="Test-set retrieval metrics - base vs v4" >
</div>
<div align="center">
<img src="figures/evaluation-test-1-2.png" alt="Test-set retrieval metrics comparison (per metric)" >
</div>
| Metric | base (Qwen3-4B) | v4 (r64/5e-5/3ep) | v4 ฮ vs base |
|--------|:---:|:---:|:---:|
| **NDCG@10** | **0.6508** | **0.6906** | **+0.040 (+6.1%)** |
| MRR@10 | 0.6805 | 0.7283 | +0.048 (+7.0%) |
| Recall@10 | 0.7244 | 0.7620 | +0.038 (+5.2%) |
| Recall@1 | 0.5081 | 0.5520 | +0.044 (+8.6%) |
| Accuracy@1 | 0.5950 | 0.6480 | +0.053 (+8.9%) |
| MAP@10 | 0.6013 | 0.6410 | +0.040 (+6.6%) |
**v4 is the released model.** Fine-tuning lifts in-domain retrieval by **roughly +7%** over the base `Qwen3-Embedding-4B`, with the largest gains on top-rank precision (Accuracy@1, Recall@1). It also surpasses the 0.6B sibling (test NDCG@10 0.6695) by **+0.021 (+3.2%)** โ a modest scale gain at ~7ร the parameters, so the 0.6B remains the better pick for latency-sensitive serving.
### Comparison with other encoders
On the *same* in-domain test set, untuned encoders โ our own `Qwen3-Embedding` base (0.6B / 4B) and public multilingual SOTA models (each run with its own native prompt format) โ all fall **well below this model**: domain fine-tuning beats general-purpose scale:
| Model | Params | NDCG@10 | MRR@10 | Recall@10 | Accuracy@1 | MAP@10 | Avg |
|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| LiquidAI/LFM2.5-Embedding-350M | 0.35B | 0.5983 | 0.6166 | 0.6799 | 0.5320 | 0.5519 | 0.5957 |
| Qwen3-Embedding-0.6B (base) | 0.6B | 0.6186 | 0.6449 | 0.7046 | 0.5560 | 0.5652 | 0.6179 |
| google/embeddinggemma-300m | 0.3B | 0.6373 | 0.6664 | 0.7082 | 0.5790 | 0.5906 | 0.6363 |
| BAAI/bge-m3 | 0.6B | 0.6426 | 0.6660 | 0.7261 | 0.5730 | 0.5913 | 0.6398 |
| intfloat/multilingual-e5-large | 0.6B | 0.6476 | 0.6722 | 0.7313 | 0.5790 | 0.5958 | 0.6452 |
| Qwen3-Embedding-4B (base) | 4B | 0.6508 | 0.6805 | 0.7244 | 0.5950 | 0.6013 | 0.6504 |
| MoAI-Embedding-0.6B (sibling) | 0.6B | 0.6695 | 0.7060 | 0.7508 | 0.6190 | 0.6171 | 0.6725 |
| **MoAI-Embedding-4B (this model)** | 4B | **0.6906** | **0.7283** | **0.7620** | **0.6480** | **0.6410** | **0.6940** |
This model improves over its own `Qwen3-Embedding-4B` base by **+0.040 NDCG@10 (+6.1%)** and leads the best general-purpose baseline (e5-large) by **+0.043 NDCG@10**. Notably, the untuned **4B base (`0.6508`) trails the fine-tuned 0.6B sibling (`0.6695`)** โ fine-tuning outweighs scale. _Caveat: these baselines are not tuned on BC Card data โ the comparison illustrates the value of domain adaptation, not a defect in the baselines._
<br>
## 2.4. Limitations
* **Domain-specific** โ tuned for BC Card Korean financial text; out-of-domain or non-Korean performance is not guaranteed.
* **Compute cost** โ at 4B, this model is markedly heavier (memory / latency) than the [0.6B sibling](https://huggingface.co/BCCard/MoAI-Embedding-0.6B); for latency- or throughput-sensitive serving, consider the 0.6B variant.
* **Re-ranking recommended** โ as a bi-encoder it favors recall over fine-grained precision.
- Recommended pipeline: **Bi-Encoder (this model) Top-K โ Cross-Encoder re-ranking**
* **Sequence length** โ inputs are truncated at 1,024 tokens; content past that limit is not encoded, so very long documents should be chunked before indexing.
* **Exact-value matching** โ fine-grained numeric/tabular facts (fees, rates, dates, terms) are not reliably distinguished by dense similarity alone; pair with lexical (BM25) retrieval or a re-ranker when exactness matters.
* **Retrieval only** โ this is an embedding model, not a generator; it ranks passages and does not produce answers.
* **Synthetic data influence** โ part of the training set is LLM-synthesized (chunking + multi-query), which may carry the generator's stylistic/coverage biases.
<br>
# 3. Future Work
* **Data quality improvement & re-training**
- Human-annotation labeling
- More rigorous hard-negative mining (iterative, mined with this model)
- Broader/higher-quality data (incl. general financial corpora)
* **System-level**
- Cross-Encoder re-ranker for precision
- HyDE / dynamic instruction injection at query time
<br>
# 4. Meta Info
## 4.1. Citation
```bibtex
@misc{bccard2026moaiembedding4b,
title = {MoAI-Embedding-4B: A BC Card-Domain Korean Text Embedding Model},
author = {BC Card AX Team},
year = {2026},
howpublished = {https://huggingface.co/BCCard/MoAI-Embedding-4B},
note = {LoRA fine-tune of Qwen3-Embedding-4B for BC Card-domain Korean retrieval}
}
```
## 4.2. See Also
* **0.6B sibling model**: [`BCCard/MoAI-Embedding-0.6B`](https://huggingface.co/BCCard/MoAI-Embedding-0.6B)
* **Training dataset**: [`BCCard/BCAI-Finance-Kor-Embedding-Triplet`](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Triplet)
* **Corpus dataset**: [`BCCard/BCAI-Finance-Kor-Embedding-Pair`](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Pair)
<br>
|