pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language:
- ko
license:
- mit
widget:
source_sentence: λνλ―Όκ΅μ μλλ μμΈμ
λλ€.
sentences:
- λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€.
- λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€.
- μμΈμ λνλ―Όκ΅μ μλμ
λλ€.
smartmind/roberta-ko-small-tsdae
This is a sentence-transformers model: It maps sentences & paragraphs to a 256 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Korean roberta small model pretrained with TSDAE.
TSDAEλ‘ μ¬μ νμ΅λ νκ΅μ΄ robertaλͺ¨λΈμ λλ€. λͺ¨λΈμ ꡬ쑰λ lassl/roberta-ko-smallκ³Ό λμΌν©λλ€. ν ν¬λμ΄μ λ λ€λ¦ λλ€.
sentence-similarityλ₯Ό ꡬνλ μ©λλ‘ λ°λ‘ μ¬μ©ν μλ μκ³ , λͺ©μ μ λ§κ² νμΈνλνμ¬ μ¬μ©ν μλ μμ΅λλ€.
Usage (Sentence-Transformers)
sentence-transformersλ₯Ό μ€μΉν λ€, λͺ¨λΈμ λ°λ‘ λΆλ¬μ¬ μ μμ΅λλ€.
pip install -U sentence-transformers
μ΄ν λ€μμ²λΌ λͺ¨λΈμ μ¬μ©ν μ μμ΅λλ€.
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('smartmind/roberta-ko-small-tsdae')
embeddings = model.encode(sentences)
print(embeddings)
λ€μμ sentence-transformersμ κΈ°λ₯μ μ¬μ©νμ¬ μ¬λ¬ λ¬Έμ₯μ μ μ¬λλ₯Ό ꡬνλ μμμ λλ€.
from sentence_transformers import util
sentences = [
"λνλ―Όκ΅μ μλλ μμΈμ
λλ€.",
"λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€.",
"λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€.",
"μμΈμ λνλ―Όκ΅μ μλμ
λλ€.",
"μ€λ μμΈμ ν루μ’
μΌ λ§μ",
]
paraphrase = util.paraphrase_mining(model, sentences)
for score, i, j in paraphrase:
print(f"{sentences[i]}\t\t{sentences[j]}\t\t{score:.4f}")
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.7616
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. 0.7031
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. 0.6594
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.6445
λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.4915
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. 0.4785
μμΈμ λνλ―Όκ΅μ μλμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.4119
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.3520
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.2550
λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.1896
Usage (HuggingFace Transformers)
sentence-transformersλ₯Ό μ€μΉνμ§ μμ μνλ‘λ λ€μμ²λΌ μ¬μ©ν μ μμ΅λλ€.
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae')
model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Evaluation Results
klue STS λ°μ΄ν°μ λν΄ λ€μ μ μλ₯Ό μ»μμ΅λλ€. μ΄ λ°μ΄ν°μ λν΄ νμΈνλνμ§ μμ μνλ‘ κ΅¬ν μ μμ λλ€.
| split | cosine_pearson | cosine_spearman | euclidean_pearson | euclidean_spearman | manhattan_pearson | manhattan_spearman | dot_pearson | dot_spearman |
|---|---|---|---|---|---|---|---|---|
| train | 0.8735 | 0.8676 | 0.8268 | 0.8357 | 0.8248 | 0.8336 | 0.8449 | 0.8383 |
| validation | 0.5409 | 0.5349 | 0.4786 | 0.4657 | 0.4775 | 0.4625 | 0.5284 | 0.5252 |
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 508, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)