--- license: cc-by-nc-4.0 base_model: Qwen/Qwen3-Embedding-8B base_model_relation: finetune language: - ko - en library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - mteb - korean - retrieval ---

Sionic AI

# comsat-embed-ko-8b-preview **comsat-embed-ko-8b-preview** is a decoder-based embedding model developed by **Sionic AI**, optimized for Korean semantic retrieval tasks. Trained on **over 1M Korean examples**, it encodes queries and documents into vectors so that the most relevant documents can be found by similarity. The model is designed to provide high-quality text representations for real-world information retrieval scenarios, including document search, question answering, knowledge base retrieval, and enterprise semantic search. By leveraging Korean retrieval-oriented training data, comsat-embed-ko-8b-preview delivers robust performance across Korean search environments where accurate semantic matching is essential. ## Highlights - **Korean-specialized** — trained on 1M+ Korean examples and tuned for Korean search; achieves **state-of-the-art average NDCG@10 (0.7930)** on the 9-subset MTEB Korean retrieval benchmark among the compared models. - **Long context** — handles inputs up to 8,192 tokens, well suited to long-document retrieval. - **Instruction-aware queries** — queries are encoded with a task-instruction prompt to improve retrieval quality; documents need no prefix. - **High-dimensional embeddings** — 4096-dimensional, last-token pooled and L2-normalized, compared with cosine similarity. ## Usage First install the Sentence Transformers library ```bash pip install -U sentence-transformers ``` ### Sentence Transformers Usage > ⚠️ Queries **must** be encoded with the query prompt; documents are encoded **without** any prefix. (Skipping the query prompt slightly degrades retrieval quality.) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("sionic-ai/comsat-embed-ko-8b-preview") queries = ["한국의 수도는 어디인가?"] passages = ["대한민국의 수도는 서울특별시이다."] # Option 1) pass the query prompt explicitly (query only; documents get no prefix) q_emb = model.encode(queries, prompt_name="query", normalize_embeddings=True) d_emb = model.encode(passages, normalize_embeddings=True) # Option 2) sentence-transformers 5.x helper API (equivalent result) # q_emb = model.encode_query(queries) # d_emb = model.encode_document(passages) scores = q_emb @ d_emb.T # cosine similarity print(scores) ``` ### Transformers Usage ```python # Requires transformers>=4.51.0 import torch import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor: left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) if left_padding: return last_hidden_states[:, -1] else: sequence_lengths = attention_mask.sum(dim=1) - 1 batch_size = last_hidden_states.shape[0] return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery:{query}' # Each query must come with a one-sentence instruction that describes the task task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, '한국의 수도는 어디인가?'), get_detailed_instruct(task, '광합성은 어떻게 일어나는가?') ] # No need to add instruction for retrieval documents documents = [ "대한민국의 수도는 서울특별시이다.", "광합성은 식물이 빛 에너지를 이용해 이산화탄소와 물로 포도당을 합성하는 과정이다." ] input_texts = queries + documents tokenizer = AutoTokenizer.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', padding_side='left') model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview') # We recommend enabling flash_attention_2 for better acceleration and memory saving. # model = AutoModel.from_pretrained('sionic-ai/comsat-embed-ko-8b-preview', attn_implementation="flash_attention_2", torch_dtype=torch.bfloat16).cuda() max_length = 8192 # Tokenize the input texts batch_dict = tokenizer( input_texts, padding=True, truncation=True, max_length=max_length, return_tensors="pt", ) batch_dict.to(model.device) outputs = model(**batch_dict) embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask']) # normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:2] @ embeddings[2:].T) print(scores.tolist()) ``` ### Korean Retrieval Benchmark - [LawIRKo](https://huggingface.co/datasets/on-and-on/lawgov_ir-ko): A **Korean legal-domain retrieval dataset** for finding statutes and precedents relevant to legal queries. - [SQuADKorV1Retrieval](https://huggingface.co/datasets/yjoonjang/squad_kor_v1): A **Korean Wikipedia passage retrieval dataset** based on Korean SQuAD v1. - [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm): A **Korean document retrieval dataset** constructed by parsing PDFs from five domains: **finance, public, medical, legal, and commerce**. - [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA): A Korean **ODQA multi-hop retrieval dataset**, translated from StrategyQA. - [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa): A **retrieval dataset** focused on **medical and public health domains** in Korean. - [BelebeleRetrieval](https://huggingface.co/datasets/mteb/belebele): A **Korean document retrieval dataset** based on FLORES-200. - [MultiLongDocRetrieval](https://huggingface.co/datasets/mteb/MultiLongDocRetrieval): A **long-document retrieval dataset** covering various domains in Korean. - [MIRACLRetrieval](https://huggingface.co/datasets/mteb/MIRACLRetrieval): A **Korean document retrieval dataset** based on Wikipedia. - [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy): A **Wikipedia-based Korean document retrieval dataset**. ## Performance (MTEB Korean Retrieval, NDCG@10) All scores are NDCG@10 on the **full corpus**, measured with the standard MTEB evaluation pipeline. For multilingual tasks the Korean subset is used (MLDR=ko, MIRACL/MrTidy=ko, Belebele=kor-kor). | Model | Avg | MIRACL | MrTidy | MLDR | AutoRAG | Ko-StrategyQA | PublicHealthQA | Belebele | SQuADKorV1 | LawIRKo | |---|---|---|---|---|---|---|---|---|---|---| | **comsat-embed-ko-8b-preview** | **0.7930** | 0.6964 | 0.6253 | 0.5183 | 0.8518 | 0.8394 | 0.8871 | 0.9853 | 0.9168 | 0.8164 | | Qwen/Qwen3-Embedding-8B | 0.7825 | 0.6783 | 0.6187 | 0.5036 | 0.8276 | 0.8363 | 0.8721 | 0.9828 | 0.9063 | 0.8171 | | Qwen/Qwen3-Embedding-4B | 0.7718 | 0.6803 | 0.6076 | 0.4895 | 0.8431 | 0.8270 | 0.8693 | 0.9479 | 0.9044 | 0.7769 | | upstage/solar-embedding-1-large | 0.7674 | 0.6703 | 0.5766 | 0.3850 | 0.8833 | 0.8366 | 0.8787 | 0.9684 | 0.9521 | 0.7557 | | microsoft/harrier-oss-v1-27b | 0.7669 | 0.6653 | 0.5306 | 0.4073 | 0.8176 | 0.8361 | 0.8971 | 0.9538 | 0.9204 | 0.8737 | | dragonkue/snowflake-arctic-embed-l-v2.0-ko | 0.7636 | 0.6685 | 0.5712 | 0.4150 | 0.9093 | 0.8050 | 0.8337 | 0.9518 | 0.9447 | 0.7735 | | codefuse-ai/F2LLM-v2-8B | 0.7621 | 0.6311 | 0.6162 | 0.3950 | 0.7678 | 0.8371 | 0.9332 | 0.9509 | 0.8874 | 0.8405 | | nlpai-lab/KURE-v1 | 0.7603 | 0.6816 | 0.5909 | 0.4521 | 0.8708 | 0.7999 | 0.8193 | 0.9502 | 0.9357 | 0.7426 | | telepix/PIXIE-Rune-v1.5 | 0.7602 | 0.6393 | 0.5492 | 0.4340 | 0.8927 | 0.8064 | 0.8426 | 0.9617 | 0.9457 | 0.7705 | | nvidia/llama-nemotron-embed-vl-1b-v2 | 0.7579 | 0.6975 | 0.5998 | 0.3704 | 0.8773 | 0.8084 | 0.8223 | 0.9584 | 0.9360 | 0.7513 | | dragonkue/BGE-m3-ko | 0.7534 | 0.6833 | 0.6099 | 0.3784 | 0.8738 | 0.7959 | 0.8155 | 0.9503 | 0.9414 | 0.7322 | | BAAI/bge-m3 | 0.7508 | 0.7015 | 0.6471 | 0.4273 | 0.8301 | 0.7941 | 0.8041 | 0.9316 | 0.9038 | 0.7174 | | intfloat/multilingual-e5-large | 0.7333 | 0.6649 | 0.6421 | 0.2708 | 0.8134 | 0.8035 | 0.8253 | 0.9450 | 0.9056 | 0.7293 | | nlpai-lab/KoE5 | 0.7329 | 0.6235 | 0.5841 | 0.2942 | 0.8434 | 0.8001 | 0.8351 | 0.9425 | 0.8980 | 0.7756 | > Avg is the mean over the 9 subsets (higher is better). > Reproduction: evaluated with the MTEB retrieval pipeline (NDCG@10, full corpus); the query prompt is applied to queries only (documents get no prefix). ## License - Model weights: **cc-by-nc-4.0** (non-commercial use).