Sentence Similarity
sentence-transformers
Safetensors
PEFT
Korean
qwen3
feature-extraction
text-embedding
information-retrieval
korean
finance
lora
text-embeddings-inference
Instructions to use BCCard/MoAI-Embedding-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use BCCard/MoAI-Embedding-0.6B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("BCCard/MoAI-Embedding-0.6B") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - PEFT
How to use BCCard/MoAI-Embedding-0.6B with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
File size: 9,783 Bytes
2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 5e7f0c2 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 0716a2a 2e73132 390c9ff 2e73132 390c9ff 2e73132 47bc504 0716a2a 73e02db 0716a2a 47bc504 0716a2a 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 5e7f0c2 390c9ff 2e73132 390c9ff 2e73132 390c9ff 2e73132 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
language:
- ko
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
base_model: Qwen/Qwen3-Embedding-0.6B
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- text-embedding
- information-retrieval
- korean
- finance
- lora
- peft
datasets:
- BCCard/BCAI-Finance-Kor-Embedding-Triplet
- BCCard/BCAI-Finance-Kor-Embedding-Pair
metrics:
- ndcg
- mrr
- recall
---
# 1. Overview
A Korean text-embedding model for the **BC Card domain**, built by LoRA fine-tuning
[`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) on BC Card in-domain data (personal / merchant / corporate / VIP). It is intended as the **retriever (bi-encoder)** stage of a BC Card RAG pipeline.
On a held-out in-domain test set it improves **NDCG@10 by +8.2%** and **Accuracy@1 by +11.3%** over the base model.
## 1.1. TL;DR
* **Base model**: [`Qwen/Qwen3-Embedding-0.6B`](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) โ 28 layers, hidden 1024, last-token pooling, instruction-aware
* **Domain / Language**: Finance (BC Card โ personal / merchant / corporate / VIP) / Korean
* **Task**: Query-document retrieval (QA search, document similarity), RAG retriever
* **Method**: PEFT (LoRA) + Multiple Negatives Ranking (contrastive)
* **Format**: merged standalone (LoRA fused into base; loads with `sentence-transformers`, no `peft`)
* **Embedding dimension**: 1024 ยท **Max sequence length**: 1024 ยท **Similarity**: cosine (outputs are L2-normalized)
* **Intended use**
- In-house **BC Card-domain RAG retriever** (Top-K candidate retrieval)
- QA search, document-similarity scoring
## 1.2. Usage
The model was trained with an **instruction prefix on the query side only** (documents get no
instruction). Inject the same instruction at inference so query/document encoding matches training.
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BCCard/MoAI-Embedding-0.6B")
# Query-side instruction (identical to training) - prepend to every query at inference time
QUERY_INSTRUCTION = "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
queries = ["BC์นด๋ ์ฐํ๋น๋ ์ด๋ป๊ฒ ๋๋์?"]
documents = [
"BC์นด๋ ์ฐํ๋น๋ ์นด๋ ์ข
๋ฅ์ ํํ ๊ตฌ์ฑ์ ๋ฐ๋ผ ๋ค๋ฅด๊ฒ ์ฑ
์ ๋ฉ๋๋ค ...",
"๋ฐ๋ก์นด๋ ์ฐํ๋น๋ ๊ตญ๋ด ์ ์ฉ๊ณผ ํด์ธ ๊ฒธ์ฉ ์ฌ๋ถ์ ๋ฐ๋ผ ์ฐจ๋ฑ ๋ถ๊ณผ๋ฉ๋๋ค ...",
"์ ์ ์ค์ ๋ฑ ์กฐ๊ฑด์ ์ถฉ์กฑํ๋ฉด ๋ค์ ํด ์ฐํ๋น๊ฐ ๋ฉด์ ๋๋ ์นด๋๋ ์์ต๋๋ค ...",
"์นด๋ ๋ถ์ค ์ ๊ณ ๋ ๊ณ ๊ฐ์ผํฐ ๋๋ ์ฑ์์ ์ฆ์ ๊ฐ๋ฅํฉ๋๋ค ...",
...
]
# Queries: inject the instruction ยท Documents: no instruction
q_emb = model.encode(queries, prompt=QUERY_INSTRUCTION)
d_emb = model.encode(documents)
scores = model.similarity(q_emb, d_emb) # cosine; rank documents by score
print(scores)
```
> The instruction is also stored in the model config, so `model.encode(queries, prompt_name="query")`
> is equivalent to passing `prompt=QUERY_INSTRUCTION` explicitly. Documents use no prompt
> (`prompt_name="document"` is an empty string).
* **Query prompt** (instruction): `Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: `
* **Document prompt**: none
## 1.3. Training Data
| Dataset | Role | Size |
|---------|------|------|
| [BCAI-Finance-Kor-Embedding-Triplet](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Triplet) | Training (anchor / positive / negative) | 43,394 triplets (train) |
| [BCAI-Finance-Kor-Embedding-Pair](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Pair) | Corpus pool / evaluation | 36,281 unique chunks |
* Sources: BC Card financial QA (BCAI) + website crawl + synthetic data (chunking + multi-query generation)
* Triplets are constructed via **hard-negative mining** over the unified corpus.
## 1.4. Training Procedure
| Item | Value |
|------|-------|
| Method | LoRA (PEFT) |
| LoRA | r=64, alpha=128, dropout=0.05, targets = q,k,v,o,gate,up,down_proj |
| Loss | CachedMultipleNegativesRankingLoss (in-batch negatives) |
| Batch | per-device 256 (DDP) โ 511 in-batch negatives per rank |
| LR / scheduler | 1e-4 / cosine, warmup_ratio 0.1, weight_decay 0.01 |
| Epochs | 3, early stopping โ best checkpoint selected by validation NDCG@10 |
| Precision | bf16, gradient checkpointing |
| Hardware | 6ร NVIDIA L40S (DDP) |
<br>
# 2. Evaluation
## 2.1. Setup
* **Queries**: 1,000 (held-out test split) ยท **Corpus**: 36,281 unique chunks
* **Protocol**: binary-relevance information retrieval; the same evaluator used during training
* **Metrics**: NDCG@10 (primary), MRR@10, Recall@{1,10}, Accuracy@1, MAP@10
* **Models compared**: base (`Qwen3-Embedding-0.6B`, no fine-tuning) vs. v1 (r32 / lr2e-4 / 4ep) vs. **v2 (r64 / lr1e-4 / 3ep, released)**
<br>
## 2.2. Training
<div align="center">
<img src="figures/evaluation-train-1-1.png" alt="Training curves - loss, learning rate, validation NDCG@10 (WandB)" >
</div>
Trained for 3 epochs (early-stopped) with a cosine schedule; training loss decreases steadily while validation NDCG@10 climbs early and plateaus, and the best checkpoint is selected at the peak. Curves (loss / learning rate / validation NDCG@10) are logged to Weights & Biases.
<br>
## 2.3. In-domain Retrieval Benchmark
<div align="center">
<img src="figures/evaluation-test-1-1.png" alt="Test-set retrieval metrics - base vs v1 vs v2" >
</div>
<div align="center">
<img src="figures/evaluation-test-1-2.png" alt="Test-set retrieval metrics comparison (per metric)" >
</div>
| Metric | base (Qwen3-0.6B) | v1 (r32/2e-4/4ep) | v2 (r64/1e-4/3ep) | v2 ฮ vs base |
|--------|:---:|:---:|:---:|:---:|
| **NDCG@10** | **0.6186** | **0.6665** | **0.6695** | **+0.051 (+8.2%)** |
| MRR@10 | 0.6449 | 0.6993 | 0.7060 | +0.061 (+9.5%) |
| Recall@10 | 0.7046 | 0.7512 | 0.7508 | +0.046 (+6.6%) |
| Recall@1 | 0.4730 | 0.5221 | 0.5293 | +0.056 (+11.9%) |
| Accuracy@1 | 0.5560 | 0.6080 | 0.6190 | +0.063 (+11.3%) |
| MAP@10 | 0.5652 | 0.6131 | 0.6171 | +0.052 (+9.2%) |
**v2 is the released model** (best across all metrics; Recall@10 is on par with v1). Fine-tuning lifts in-domain retrieval by roughly **+10%** over the base model, with the largest gains on top-rank precision (Accuracy@1, Recall@1).
### Comparison with other encoders
On the *same* in-domain test set, untuned encoders โ our own `Qwen3-Embedding-0.6B` base and public multilingual SOTA models (each run with its own native prompt format) โ all fall **below this model**: domain fine-tuning beats general-purpose scale:
| Model | Params | NDCG@10 | MRR@10 | Recall@10 | Accuracy@1 | MAP@10 | Avg |
|-------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| LiquidAI/LFM2.5-Embedding-350M | 0.35B | 0.5983 | 0.6166 | 0.6799 | 0.5320 | 0.5519 | 0.5957 |
| Qwen3-Embedding-0.6B (base) | 0.6B | 0.6186 | 0.6449 | 0.7046 | 0.5560 | 0.5652 | 0.6179 |
| google/embeddinggemma-300m | 0.3B | 0.6373 | 0.6664 | 0.7082 | 0.5790 | 0.5906 | 0.6363 |
| BAAI/bge-m3 | 0.6B | 0.6426 | 0.6660 | 0.7261 | 0.5730 | 0.5913 | 0.6398 |
| intfloat/multilingual-e5-large | 0.6B | 0.6476 | 0.6722 | 0.7313 | 0.5790 | 0.5958 | 0.6452 |
| **MoAI-Embedding-0.6B (this model)** | 0.6B | **0.6695** | **0.7060** | **0.7508** | **0.6190** | **0.6171** | **0.6725** |
This model improves over its own `Qwen3-Embedding-0.6B` base by **+0.051 NDCG@10 (+8.2%)** and leads the best general-purpose baseline (e5-large) by **+0.022 NDCG@10**. _Caveat: these baselines are not tuned on BC Card data โ the comparison illustrates the value of domain adaptation, not a defect in the baselines._
<br>
## 2.4. Limitations
* **Domain-specific** โ tuned for BC Card Korean financial text; out-of-domain or non-Korean performance is not guaranteed.
* **Re-ranking recommended** โ as a 0.6B bi-encoder, it favors recall/throughput over fine-grained precision.
- Recommended pipeline: **Bi-Encoder (this model) Top-K โ Cross-Encoder re-ranking**
* **Sequence length** โ inputs are truncated at 1,024 tokens; content past that limit is not encoded, so very long documents should be chunked before indexing.
* **Exact-value matching** โ fine-grained numeric/tabular facts (fees, rates, dates, terms) are not reliably distinguished by dense similarity alone; pair with lexical (BM25) retrieval or a re-ranker when exactness matters.
* **Retrieval only** โ this is an embedding model, not a generator; it ranks passages and does not produce answers.
* **Synthetic data influence** โ part of the training set is LLM-synthesized (chunking + multi-query), which may carry the generator's stylistic/coverage biases.
<br>
# 3. Future Work
* **Data quality improvement & re-training**
- Human-annotation labeling
- More rigorous hard-negative mining (iterative, mined with this model)
- Broader/higher-quality data (incl. general financial corpora)
* **System-level**
- Cross-Encoder re-ranker for precision
- HyDE / dynamic instruction injection at query time
<br>
# 4. Meta Info
## 4.1. Citation
```bibtex
@misc{bccard2026moaiembedding,
title = {MoAI-Embedding-0.6B: A BC Card-Domain Korean Text Embedding Model},
author = {BC Card AX Team},
year = {2026},
howpublished = {https://huggingface.co/BCCard/MoAI-Embedding-0.6B},
note = {LoRA fine-tune of Qwen3-Embedding-0.6B for BC Card-domain Korean retrieval}
}
```
## 4.2. See Also
* **Training dataset**: [`BCCard/BCAI-Finance-Kor-Embedding-Triplet`](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Triplet)
* **Corpus dataset**: [`BCCard/BCAI-Finance-Kor-Embedding-Pair`](https://huggingface.co/datasets/BCCard/BCAI-Finance-Kor-Embedding-Pair)
<br>
|