SamilPwC-AXNode-GenAI
/

PwC-Embedding_expr

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Model card Files Files and versions

elplaguister commited on Aug 16, 2025

Commit

4c73846

·

verified ·

1 Parent(s): 3335897

Add Model card

Files changed (1) hide show

README.md +47 -5

README.md CHANGED Viewed

@@ -1,8 +1,50 @@
 ---
-license: apache-2.0
 language:
 - ko
-base_model:
-- intfloat/multilingual-e5-large-instruct
-- FacebookAI/xlm-roberta-large
----

 ---
 language:
 - ko
+license: apache-2.0
+tags:
+- sentence-transformers
+- sentence-similarity
+- transformers
+---
+## PwC-Embedding-expr
+We trained the **PwC-Embedding-expr** model on top of the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) embedding model.
+To enhance performance in Korean, we applied our curated augmentation to STS datasets and fine-tuned the E5 model using a carefully balanced ratio across datasets.
+### To-do
+- [ ] MTEB Leaderboard
+- [ ] Technical Report
+## MTEB
+PwC-Embedding_expr was evaluated on the Korean subset of MTEB.
+A leaderboard link will be added once it is published.
+| Task             | PwC-Embedding_expr | multilingual-e5-large | Max Result |
+|------------------|--------------------|-----------------------|------------|
+| KLUE-STS         | 0.88               | 0.83                  | 0.90       |
+| KLUE-TC          | 0.73               | 0.61                  | 0.73       |
+| Ko-StrategyQA    | 0.80               | 0.80                  | 0.83       |
+| KorSTS           | 0.84               | 0.81                  | 0.98       |
+| MIRACL-Reranking | 0.72               | 0.65                  | 0.72       |
+| MIRACL-Retrieval | 0.65               | 0.59                  | 0.72       |
+| **Average**      | **0.77**           | 0.71                  | 0.81       |
+## Model
+- Base Model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
+- Model Size: 0.56B
+- Embedding Dimension: 1024
+- Max Input Tokens: 514
+## Requirements
+It works with the dependencies included in the latest version of MTEB.
+## Citation
+TBD (technical report expected September 2025)