Instructions to use sionic-ai/comsat-embed-ko-8b-preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use sionic-ai/comsat-embed-ko-8b-preview with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("sionic-ai/comsat-embed-ko-8b-preview") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
comsat-embed-ko-8b-preview
comsat-embed-ko-8b-preview is a decoder-based embedding model developed by Sionic AI, optimized for Korean semantic retrieval tasks. The model is designed to provide high-quality text representations for real-world information retrieval scenarios, including document search, question answering, knowledge base retrieval, and enterprise semantic search. By leveraging Korean retrieval-oriented training data, comsat-embed-ko-8b-preview delivers robust performance across Korean search environments where accurate semantic matching is essential.
Highlights
- Korean-specialized β trained on 1M+ Korean examples and tuned for Korean search; ranks #1 by average NDCG@10 on the 9-subset MTEB Korean retrieval benchmark among publicly available models.
- Long context β handles inputs up to 8,192 tokens, well suited to long-document retrieval.
- Instruction-aware queries β queries are encoded with a task-instruction prompt to improve retrieval quality; documents need no prefix.
- High-dimensional embeddings β 4096-dimensional, last-token pooled and L2-normalized, compared with cosine similarity.
Usage
First install the Sentence Transformers library
pip install -U sentence-transformers
Sentence Transformers Usage
β οΈ Queries must be encoded with the query prompt; documents are encoded without any prefix. (Skipping the query prompt slightly degrades retrieval quality.)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sionic-ai/comsat-embed-ko-8b-preview")
queries = ["νκ΅μ μλλ μ΄λμΈκ°?"]
passages = ["λνλ―Όκ΅μ μλλ μμΈνΉλ³μμ΄λ€."]
# Option 1) pass the query prompt explicitly (query only; documents get no prefix)
q_emb = model.encode(queries, prompt_name="query", normalize_embeddings=True)
d_emb = model.encode(passages, normalize_embeddings=True)
# Option 2) sentence-transformers 5.x helper API (equivalent result)
# q_emb = model.encode_query(queries)
# d_emb = model.encode_document(passages)
scores = q_emb @ d_emb.T # cosine similarity
print(scores)
Transformers Usage
# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f'Instruct: {task_description}\nQuery:{query}'
# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'νκ΅μ μλλ μ΄λμΈκ°?'),
get_detailed_instruct(task, 'κ΄ν©μ±μ μ΄λ»κ² μΌμ΄λλκ°?')
]
# No need to add instruction for retrieval documents
documents = [
"λνλ―Όκ΅μ μλλ μμΈνΉλ³μμ΄λ€.",
"κ΄ν©μ±μ μλ¬Όμ΄ λΉ μλμ§λ₯Ό μ΄μ©ν΄ μ΄μ°ννμμ λ¬Όλ‘ ν¬λλΉμ ν©μ±νλ κ³Όμ μ΄λ€."
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', padding_side='left')
model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview')
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).cuda()
max_length = 8192
# Tokenize the input texts
batch_dict = tokenizer(
input_texts,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())
Korean Retrieval Benchmark
- LawIRKo: A Korean legal-domain retrieval dataset for finding statutes and precedents relevant to legal queries.
- SQuADKorV1Retrieval: A Korean Wikipedia passage retrieval dataset based on Korean SQuAD v1.
- AutoRAGRetrieval: A Korean document retrieval dataset constructed by parsing PDFs from five domains: finance, public, medical, legal, and commerce.
- Ko-StrategyQA: A Korean ODQA multi-hop retrieval dataset, translated from StrategyQA.
- PublicHealthQA: A retrieval dataset focused on medical and public health domains in Korean.
- BelebeleRetrieval: A Korean document retrieval dataset based on FLORES-200.
- MultiLongDocRetrieval: A long-document retrieval dataset covering various domains in Korean.
- MIRACLRetrieval: A Korean document retrieval dataset based on Wikipedia.
- MrTidyRetrieval: A Wikipedia-based Korean document retrieval dataset.
Performance (MTEB Korean Retrieval, NDCG@10)
All scores are NDCG@10 on the full corpus, measured with the standard MTEB evaluation pipeline. For multilingual tasks the Korean subset is used (MLDR=ko, MIRACL/MrTidy=ko, Belebele=kor-kor).
| Model | Avg | MIRACL | MrTidy | MLDR | AutoRAG | Ko-StrategyQA | PublicHealthQA | Belebele | SQuADKorV1 | LawIRKo |
|---|---|---|---|---|---|---|---|---|---|---|
| comsat-embed-ko-8b-preview | 0.7930 | 0.6964 | 0.6253 | 0.5183 | 0.8518 | 0.8394 | 0.8871 | 0.9853 | 0.9168 | 0.8164 |
| Qwen3-Embedding-8B | 0.7825 | 0.6783 | 0.6187 | 0.5036 | 0.8276 | 0.8363 | 0.8721 | 0.9828 | 0.9063 | 0.8171 |
| Qwen3-Embedding-4B | 0.7718 | 0.6803 | 0.6076 | 0.4895 | 0.8431 | 0.8270 | 0.8693 | 0.9479 | 0.9044 | 0.7769 |
| upstage/solar-embedding-1-large | 0.7674 | 0.6703 | 0.5766 | 0.3850 | 0.8833 | 0.8366 | 0.8787 | 0.9684 | 0.9521 | 0.7557 |
| microsoft/harrier-oss-v1-27b | 0.7669 | 0.6653 | 0.5306 | 0.4073 | 0.8176 | 0.8361 | 0.8971 | 0.9538 | 0.9204 | 0.8737 |
| dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.7636 | 0.6685 | 0.5712 | 0.4150 | 0.9093 | 0.8050 | 0.8337 | 0.9518 | 0.9447 | 0.7735 |
| codefuse-ai/F2LLM-v2-8B | 0.7621 | 0.6311 | 0.6162 | 0.3950 | 0.7678 | 0.8371 | 0.9332 | 0.9509 | 0.8874 | 0.8405 |
| nlpai-lab/KURE-v1 | 0.7603 | 0.6816 | 0.5909 | 0.4521 | 0.8708 | 0.7999 | 0.8193 | 0.9502 | 0.9357 | 0.7426 |
| telepix/PIXIE-Rune-v1.5 | 0.7602 | 0.6393 | 0.5492 | 0.4340 | 0.8927 | 0.8064 | 0.8426 | 0.9617 | 0.9457 | 0.7705 |
| nvidia/llama-nemotron-embed-vl-1b-v2 | 0.7579 | 0.6975 | 0.5998 | 0.3704 | 0.8773 | 0.8084 | 0.8223 | 0.9584 | 0.9360 | 0.7513 |
| dragonkue/BGE-m3-ko | 0.7534 | 0.6833 | 0.6099 | 0.3784 | 0.8738 | 0.7959 | 0.8155 | 0.9503 | 0.9414 | 0.7322 |
| BAAI/bge-m3 | 0.7508 | 0.7015 | 0.6471 | 0.4273 | 0.8301 | 0.7941 | 0.8041 | 0.9316 | 0.9038 | 0.7174 |
| intfloat/multilingual-e5-large | 0.7333 | 0.6649 | 0.6421 | 0.2708 | 0.8134 | 0.8035 | 0.8253 | 0.9450 | 0.9056 | 0.7293 |
| nlpai-lab/KoE5 | 0.7329 | 0.6235 | 0.5841 | 0.2942 | 0.8434 | 0.8001 | 0.8351 | 0.9425 | 0.8980 | 0.7756 |
Avg is the mean over the 9 subsets (higher is better). Reproduction: evaluated with the MTEB retrieval pipeline (NDCG@10, full corpus); the query prompt is applied to queries only (documents get no prefix).
License
- Model weights: cc-by-nc-4.0 (non-commercial use).
- Downloads last month
- -