SamilPwC-AXNode-GenAI
/

PwC-Embedding_expr

Sentence Similarity

sentence-transformers

feature-extraction

text-embeddings-inference

Model card Files Files and versions

PwC-Embedding_expr / README.md

elplaguister's picture

Update README.md

6c51969 verified 8 months ago

|

1.47 kB

	---
	language:
	- ko
	license: apache-2.0
	tags:
	- sentence-transformers
	- sentence-similarity
	- transformers
	---

	## PwC-Embedding-expr

	We trained the PwC-Embedding-expr model on top of the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) embedding model.
	To enhance performance in Korean, we applied our curated augmentation to STS datasets and fine-tuned the E5 model using a carefully balanced ratio across datasets.

	> ⚠️ This is an experimental model and is under continuous development.

	### To-do
	- [x] MTEB Leaderboard
	- [ ] Technical Report


	## MTEB
	PwC-Embedding_expr was evaluated on the Korean subset of MTEB.
	A leaderboard link will be added once it is published.

	\| Task \| PwC-Embedding_expr \|
	\|------------------\|--------------------\|
	\| KLUE-STS \| 0.88 \|
	\| KLUE-TC \| 0.73 \|
	\| Ko-StrategyQA \| 0.80 \|
	\| KorSTS \| 0.84 \|
	\| MIRACL-Reranking \| 0.72 \|
	\| MIRACL-Retrieval \| 0.65 \|
	\| Average \| 0.77 \|


	## Model
	- Base Model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
	- Model Size: 0.56B
	- Embedding Dimension: 1024
	- Max Input Tokens: 514


	## Requirements
	It works with the dependencies included in the latest version of MTEB.


	## Citation

	TBD (technical report expected September 2025)