File size: 10,730 Bytes
bd15a2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6235d35
bd15a2c
52ba794
 
646b3bf
6235d35
 
 
 
 
 
 
 
 
 
646b3bf
52ba794
646b3bf
bd15a2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: Qwen/Qwen3-Embedding-4B
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- text-embedding
- information-retrieval
- korean
- finance
- lora
- peft
datasets:
- BCCard/BCAI-Finance-Kor-Embedding-Triplet
- BCCard/BCAI-Finance-Kor-Embedding-Pair
metrics:
- ndcg
- mrr
- recall
---

# 1. Overview
A Korean text-embedding model for the **BC Card domain**, built by LoRA fine-tuning
[`Qwen/Qwen3-Embedding-4B`](https://huggingface.co/Qwen/Qwen3-Embedding-4B) on BC Card in-domain data (personal / merchant / corporate / VIP). It is intended as the **retriever (bi-encoder)** stage of a BC Card RAG pipeline.

This is the **4B-scale** sibling of [`BCCard/MoAI-Embedding-0.6B`](https://huggingface.co/BCCard/MoAI-Embedding-0.6B) โ€” a larger-capacity variant for higher retrieval quality at the cost of compute/latency.

On a held-out in-domain test set it improves **NDCG@10 by +6.1%** and **Accuracy@1 by +8.9%** over the base `Qwen3-Embedding-4B` (full metrics in ยง2.3).

## 1.1. TL;DR
* **Base model**: [`Qwen/Qwen3-Embedding-4B`](https://huggingface.co/Qwen/Qwen3-Embedding-4B) โ€” 36 layers, hidden 2560, last-token pooling, instruction-aware
* **Domain / Language**: Finance (BC Card โ€” personal / merchant / corporate / VIP) / Korean
* **Task**: Query-document retrieval (QA search, document similarity), RAG retriever
* **Method**: PEFT (LoRA) + Multiple Negatives Ranking (contrastive)
* **Format**: merged standalone (LoRA fused into base; loads with `sentence-transformers`, no `peft`)
* **Embedding dimension**: 2560 ยท **Max sequence length**: 1024 ยท **Similarity**: cosine (outputs are L2-normalized)
* **Intended use**
  - In-house **BC Card-domain RAG retriever** (Top-K candidate retrieval)
  - QA search, document-similarity scoring

## 1.2. Usage

The model was trained with an **instruction prefix on the query side only** (documents get no
instruction). Inject the same instruction at inference so query/document encoding matches training.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BCCard/MoAI-Embedding-4B")

# Query-side instruction (identical to training) - prepend to every query at inference time
QUERY_INSTRUCTION = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "

queries = ["BC์นด๋“œ ์—ฐํšŒ๋น„๋Š” ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?"]
documents = [
    "BC์นด๋“œ ์—ฐํšŒ๋น„๋Š” ์นด๋“œ ์ข…๋ฅ˜์™€ ํ˜œํƒ ๊ตฌ์„ฑ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ์ฑ…์ •๋ฉ๋‹ˆ๋‹ค ...",
    "๋ฐ”๋กœ์นด๋“œ ์—ฐํšŒ๋น„๋Š” ๊ตญ๋‚ด ์ „์šฉ๊ณผ ํ•ด์™ธ ๊ฒธ์šฉ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ์ฐจ๋“ฑ ๋ถ€๊ณผ๋ฉ๋‹ˆ๋‹ค ...",
    "์ „์›” ์‹ค์  ๋“ฑ ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋ฉด ๋‹ค์Œ ํ•ด ์—ฐํšŒ๋น„๊ฐ€ ๋ฉด์ œ๋˜๋Š” ์นด๋“œ๋„ ์žˆ์Šต๋‹ˆ๋‹ค ...",
    "์นด๋“œ ๋ถ„์‹ค ์‹ ๊ณ ๋Š” ๊ณ ๊ฐ์„ผํ„ฐ ๋˜๋Š” ์•ฑ์—์„œ ์ฆ‰์‹œ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค ...",
    ...
]

# Queries: inject the instruction ยท Documents: no instruction
q_emb = model.encode(queries, prompt=QUERY_INSTRUCTION)
d_emb = model.encode(documents)

scores = model.similarity(q_emb, d_emb)   # cosine; rank documents by score
print(scores)
```

> The instruction is also stored in the model config, so `model.encode(queries, prompt_name="query")`
> is equivalent to passing `prompt=QUERY_INSTRUCTION` explicitly. Documents use no prompt
> (`prompt_name="document"` is an empty string).

* **Query prompt** (instruction): `Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: `
* **Document prompt**: none

## 1.3. Training Data
| Dataset | Role | Size |
|---------|------|------|
| [BCAI-Finance-Kor-Embedding-Triplet](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Triplet) | Training (anchor / positive / negative) | 43,394 triplets (train) |
| [BCAI-Finance-Kor-Embedding-Pair](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Pair) | Corpus pool / evaluation | 36,281 unique chunks |

* Sources: BC Card financial QA (BCAI) + website crawl + synthetic data (chunking + multi-query generation)
* Triplets are constructed via **hard-negative mining** over the unified corpus.

## 1.4. Training Procedure
| Item | Value |
|------|-------|
| Method | LoRA (PEFT) |
| LoRA | r=64, alpha=128, dropout=0.05, targets = q,k,v,o,gate,up,down_proj |
| Loss | CachedMultipleNegativesRankingLoss (in-batch negatives) |
| Batch | per-device 256 (DDP) โ†’ 511 in-batch negatives per rank |
| LR / scheduler | 5e-5 / cosine, warmup_ratio 0.1, weight_decay 0.01 |
| Epochs | 3, early stopping โ€” best checkpoint selected by validation NDCG@10 |
| Precision | bf16, gradient checkpointing |
| Hardware | 8ร— NVIDIA RTX PRO 6000 Blackwell (DDP) |

<br>

# 2. Evaluation
## 2.1. Setup
* **Queries**: 1,000 (held-out test split) ยท **Corpus**: 36,281 unique chunks
* **Protocol**: binary-relevance information retrieval; the same evaluator used during training
* **Metrics**: NDCG@10 (primary), MRR@10, Recall@{1,10}, Accuracy@1, MAP@10
* **Models compared**: base (`Qwen3-Embedding-4B`, no fine-tuning) vs. **v4 (r64 / lr5e-5 / 3ep, released)**

<br>

## 2.2. Training
<div align="center">
  <img src="figures/evaluation-train-1-1.png" alt="Training curves - loss, learning rate, validation NDCG@10 (WandB)" >
</div>

Trained for 3 epochs (early-stopped) with a cosine schedule; training loss decreases steadily while validation NDCG@10 climbs early and plateaus (peak โ‰ˆ 0.695 around epoch ~1.4), and the best checkpoint is selected at the peak. Curves (loss / learning rate / validation NDCG@10) are logged to Weights & Biases.

<br>

## 2.3. In-domain Retrieval Benchmark
<div align="center">
  <img src="figures/evaluation-test-1-1.png" alt="Test-set retrieval metrics - base vs v4" >
</div>
<div align="center">
  <img src="figures/evaluation-test-1-2.png" alt="Test-set retrieval metrics comparison (per metric)" >
</div>

| Metric | base (Qwen3-4B) | v4 (r64/5e-5/3ep) | v4 ฮ” vs base |
|--------|:---:|:---:|:---:|
| **NDCG@10** | **0.6508** | **0.6906** | **+0.040 (+6.1%)** |
| MRR@10 | 0.6805 | 0.7283 | +0.048 (+7.0%) |
| Recall@10 | 0.7244 | 0.7620 | +0.038 (+5.2%) |
| Recall@1 | 0.5081 | 0.5520 | +0.044 (+8.6%) |
| Accuracy@1 | 0.5950 | 0.6480 | +0.053 (+8.9%) |
| MAP@10 | 0.6013 | 0.6410 | +0.040 (+6.6%) |

**v4 is the released model.** Fine-tuning lifts in-domain retrieval by **roughly +7%** over the base `Qwen3-Embedding-4B`, with the largest gains on top-rank precision (Accuracy@1, Recall@1). It also surpasses the 0.6B sibling (test NDCG@10 0.6695) by **+0.021 (+3.2%)** โ€” a modest scale gain at ~7ร— the parameters, so the 0.6B remains the better pick for latency-sensitive serving.

### Comparison with other encoders
On the *same* in-domain test set, untuned encoders โ€” our own `Qwen3-Embedding` base (0.6B / 4B) and public multilingual SOTA models (each run with its own native prompt format) โ€” all fall **well below this model**: domain fine-tuning beats general-purpose scale:

| Model | Params | NDCG@10 | MRR@10 | Recall@10 | Accuracy@1 | MAP@10 | Avg |
|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| LiquidAI/LFM2.5-Embedding-350M | 0.35B | 0.5983 | 0.6166 | 0.6799 | 0.5320 | 0.5519 | 0.5957 |
| Qwen3-Embedding-0.6B (base) | 0.6B | 0.6186 | 0.6449 | 0.7046 | 0.5560 | 0.5652 | 0.6179 |
| google/embeddinggemma-300m | 0.3B | 0.6373 | 0.6664 | 0.7082 | 0.5790 | 0.5906 | 0.6363 |
| BAAI/bge-m3 | 0.6B | 0.6426 | 0.6660 | 0.7261 | 0.5730 | 0.5913 | 0.6398 |
| intfloat/multilingual-e5-large | 0.6B | 0.6476 | 0.6722 | 0.7313 | 0.5790 | 0.5958 | 0.6452 |
| Qwen3-Embedding-4B (base) | 4B | 0.6508 | 0.6805 | 0.7244 | 0.5950 | 0.6013 | 0.6504 |
| MoAI-Embedding-0.6B (sibling) | 0.6B | 0.6695 | 0.7060 | 0.7508 | 0.6190 | 0.6171 | 0.6725 |
| **MoAI-Embedding-4B (this model)** | 4B | **0.6906** | **0.7283** | **0.7620** | **0.6480** | **0.6410** | **0.6940** |

This model improves over its own `Qwen3-Embedding-4B` base by **+0.040 NDCG@10 (+6.1%)** and leads the best general-purpose baseline (e5-large) by **+0.043 NDCG@10**. Notably, the untuned **4B base (`0.6508`) trails the fine-tuned 0.6B sibling (`0.6695`)** โ€” fine-tuning outweighs scale. _Caveat: these baselines are not tuned on BC Card data โ€” the comparison illustrates the value of domain adaptation, not a defect in the baselines._

<br>

## 2.4. Limitations
* **Domain-specific** โ€” tuned for BC Card Korean financial text; out-of-domain or non-Korean performance is not guaranteed.
* **Compute cost** โ€” at 4B, this model is markedly heavier (memory / latency) than the [0.6B sibling](https://huggingface.co/BCCard/MoAI-Embedding-0.6B); for latency- or throughput-sensitive serving, consider the 0.6B variant.
* **Re-ranking recommended** โ€” as a bi-encoder it favors recall over fine-grained precision.
    - Recommended pipeline: **Bi-Encoder (this model) Top-K โ†’ Cross-Encoder re-ranking**
* **Sequence length** โ€” inputs are truncated at 1,024 tokens; content past that limit is not encoded, so very long documents should be chunked before indexing.
* **Exact-value matching** โ€” fine-grained numeric/tabular facts (fees, rates, dates, terms) are not reliably distinguished by dense similarity alone; pair with lexical (BM25) retrieval or a re-ranker when exactness matters.
* **Retrieval only** โ€” this is an embedding model, not a generator; it ranks passages and does not produce answers.
* **Synthetic data influence** โ€” part of the training set is LLM-synthesized (chunking + multi-query), which may carry the generator's stylistic/coverage biases.

<br>

# 3. Future Work
* **Data quality improvement & re-training**
	- Human-annotation labeling
	- More rigorous hard-negative mining (iterative, mined with this model)
	- Broader/higher-quality data (incl. general financial corpora)
* **System-level**
	- Cross-Encoder re-ranker for precision
	- HyDE / dynamic instruction injection at query time

<br>

# 4. Meta Info
## 4.1. Citation
```bibtex
@misc{bccard2026moaiembedding4b,
  title        = {MoAI-Embedding-4B: A BC Card-Domain Korean Text Embedding Model},
  author       = {BC Card AX Team},
  year         = {2026},
  howpublished = {https://huggingface.co/BCCard/MoAI-Embedding-4B},
  note         = {LoRA fine-tune of Qwen3-Embedding-4B for BC Card-domain Korean retrieval}
}
```

## 4.2. See Also
* **0.6B sibling model**: [`BCCard/MoAI-Embedding-0.6B`](https://huggingface.co/BCCard/MoAI-Embedding-0.6B)
* **Training dataset**: [`BCCard/BCAI-Finance-Kor-Embedding-Triplet`](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Triplet)
* **Corpus dataset**: [`BCCard/BCAI-Finance-Kor-Embedding-Pair`](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Pair)

<br>