Sionic AI

comsat-embed-ko-8b-preview

comsat-embed-ko-8b-preview is a decoder-based embedding model developed by Sionic AI, optimized for Korean semantic retrieval tasks. The model is designed to provide high-quality text representations for real-world information retrieval scenarios, including document search, question answering, knowledge base retrieval, and enterprise semantic search. By leveraging Korean retrieval-oriented training data, comsat-embed-ko-8b-preview delivers robust performance across Korean search environments where accurate semantic matching is essential.

Highlights

Korean-specialized — trained on 1M+ Korean examples and tuned for Korean search; ranks #1 by average NDCG@10 on the 9-subset MTEB Korean retrieval benchmark among publicly available models.
Long context — handles inputs up to 8,192 tokens, well suited to long-document retrieval.
Instruction-aware queries — queries are encoded with a task-instruction prompt to improve retrieval quality; documents need no prefix.
High-dimensional embeddings — 4096-dimensional, last-token pooled and L2-normalized, compared with cosine similarity.

Usage

First install the Sentence Transformers library

pip install -U sentence-transformers

Sentence Transformers Usage

⚠️ Queries must be encoded with the query prompt; documents are encoded without any prefix. (Skipping the query prompt slightly degrades retrieval quality.)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sionic-ai/comsat-embed-ko-8b-preview")

queries  = ["한국의 수도는 어디인가?"]
passages = ["대한민국의 수도는 서울특별시이다."]

# Option 1) pass the query prompt explicitly (query only; documents get no prefix)
q_emb = model.encode(queries,  prompt_name="query", normalize_embeddings=True)
d_emb = model.encode(passages,                      normalize_embeddings=True)

# Option 2) sentence-transformers 5.x helper API (equivalent result)
# q_emb = model.encode_query(queries)
# d_emb = model.encode_document(passages)

scores = q_emb @ d_emb.T   # cosine similarity
print(scores)

Transformers Usage

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery:{query}'

# Each query must come with a one-sentence instruction that describes the task
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = [
    get_detailed_instruct(task, '한국의 수도는 어디인가?'),
    get_detailed_instruct(task, '광합성은 어떻게 일어나는가?')
]
# No need to add instruction for retrieval documents
documents = [
    "대한민국의 수도는 서울특별시이다.",
    "광합성은 식물이 빛 에너지를 이용해 이산화탄소와 물로 포도당을 합성하는 과정이다."
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', padding_side='left')
model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview')

# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).cuda()

max_length = 8192

# Tokenize the input texts
batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=max_length,
    return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T)
print(scores.tolist())

Korean Retrieval Benchmark

LawIRKo: A Korean legal-domain retrieval dataset for finding statutes and precedents relevant to legal queries.
SQuADKorV1Retrieval: A Korean Wikipedia passage retrieval dataset based on Korean SQuAD v1.
AutoRAGRetrieval: A Korean document retrieval dataset constructed by parsing PDFs from five domains: finance, public, medical, legal, and commerce.
Ko-StrategyQA: A Korean ODQA multi-hop retrieval dataset, translated from StrategyQA.
PublicHealthQA: A retrieval dataset focused on medical and public health domains in Korean.
BelebeleRetrieval: A Korean document retrieval dataset based on FLORES-200.
MultiLongDocRetrieval: A long-document retrieval dataset covering various domains in Korean.
MIRACLRetrieval: A Korean document retrieval dataset based on Wikipedia.
MrTidyRetrieval: A Wikipedia-based Korean document retrieval dataset.

Performance (MTEB Korean Retrieval, NDCG@10)

All scores are NDCG@10 on the full corpus, measured with the standard MTEB evaluation pipeline. For multilingual tasks the Korean subset is used (MLDR=ko, MIRACL/MrTidy=ko, Belebele=kor-kor).

Model	Avg	MIRACL	MrTidy	MLDR	AutoRAG	Ko-StrategyQA	PublicHealthQA	Belebele	SQuADKorV1	LawIRKo
comsat-embed-ko-8b-preview	0.7930	0.6964	0.6253	0.5183	0.8518	0.8394	0.8871	0.9853	0.9168	0.8164
Qwen3-Embedding-8B	0.7825	0.6783	0.6187	0.5036	0.8276	0.8363	0.8721	0.9828	0.9063	0.8171
Qwen3-Embedding-4B	0.7718	0.6803	0.6076	0.4895	0.8431	0.8270	0.8693	0.9479	0.9044	0.7769
upstage/solar-embedding-1-large	0.7674	0.6703	0.5766	0.3850	0.8833	0.8366	0.8787	0.9684	0.9521	0.7557
microsoft/harrier-oss-v1-27b	0.7669	0.6653	0.5306	0.4073	0.8176	0.8361	0.8971	0.9538	0.9204	0.8737
dragonkue/snowflake-arctic-embed-l-v2.0-ko	0.7636	0.6685	0.5712	0.4150	0.9093	0.8050	0.8337	0.9518	0.9447	0.7735
codefuse-ai/F2LLM-v2-8B	0.7621	0.6311	0.6162	0.3950	0.7678	0.8371	0.9332	0.9509	0.8874	0.8405
nlpai-lab/KURE-v1	0.7603	0.6816	0.5909	0.4521	0.8708	0.7999	0.8193	0.9502	0.9357	0.7426
telepix/PIXIE-Rune-v1.5	0.7602	0.6393	0.5492	0.4340	0.8927	0.8064	0.8426	0.9617	0.9457	0.7705
nvidia/llama-nemotron-embed-vl-1b-v2	0.7579	0.6975	0.5998	0.3704	0.8773	0.8084	0.8223	0.9584	0.9360	0.7513
dragonkue/BGE-m3-ko	0.7534	0.6833	0.6099	0.3784	0.8738	0.7959	0.8155	0.9503	0.9414	0.7322
BAAI/bge-m3	0.7508	0.7015	0.6471	0.4273	0.8301	0.7941	0.8041	0.9316	0.9038	0.7174
intfloat/multilingual-e5-large	0.7333	0.6649	0.6421	0.2708	0.8134	0.8035	0.8253	0.9450	0.9056	0.7293
nlpai-lab/KoE5	0.7329	0.6235	0.5841	0.2942	0.8434	0.8001	0.8351	0.9425	0.8980	0.7756

Avg is the mean over the 9 subsets (higher is better). Reproduction: evaluated with the MTEB retrieval pipeline (NDCG@10, full corpus); the query prompt is applied to queries only (documents get no prefix).

License

Model weights: cc-by-nc-4.0 (non-commercial use).

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for sionic-ai/comsat-embed-ko-8b-preview

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Finetuned

(36)

this model