|
|
--- |
|
|
pipeline_tag: sentence-similarity |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- transformers |
|
|
language: |
|
|
- ko |
|
|
license: |
|
|
- mit |
|
|
widget: |
|
|
source_sentence: "λνλ―Όκ΅μ μλλ μμΈμ
λλ€." |
|
|
sentences: |
|
|
- "λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€." |
|
|
- "λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€." |
|
|
- "μμΈμ λνλ―Όκ΅μ μλμ
λλ€." |
|
|
--- |
|
|
|
|
|
# smartmind/roberta-ko-small-tsdae |
|
|
|
|
|
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 256 dimensional dense vector space and can be used for tasks like clustering or semantic search. |
|
|
|
|
|
Korean roberta small model pretrained with [TSDAE](https://arxiv.org/abs/2104.06979). |
|
|
|
|
|
[TSDAE](https://arxiv.org/abs/2104.06979)λ‘ μ¬μ νμ΅λ νκ΅μ΄ robertaλͺ¨λΈμ
λλ€. λͺ¨λΈμ ꡬ쑰λ [lassl/roberta-ko-small](https://huggingface.co/lassl/roberta-ko-small)κ³Ό λμΌν©λλ€. ν ν¬λμ΄μ λ λ€λ¦
λλ€. |
|
|
|
|
|
sentence-similarityλ₯Ό ꡬνλ μ©λλ‘ λ°λ‘ μ¬μ©ν μλ μκ³ , λͺ©μ μ λ§κ² νμΈνλνμ¬ μ¬μ©ν μλ μμ΅λλ€. |
|
|
|
|
|
## Usage (Sentence-Transformers) |
|
|
|
|
|
[sentence-transformers](https://www.SBERT.net)λ₯Ό μ€μΉν λ€, λͺ¨λΈμ λ°λ‘ λΆλ¬μ¬ μ μμ΅λλ€. |
|
|
|
|
|
``` |
|
|
pip install -U sentence-transformers |
|
|
``` |
|
|
|
|
|
μ΄ν λ€μμ²λΌ λͺ¨λΈμ μ¬μ©ν μ μμ΅λλ€. |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
sentences = ["This is an example sentence", "Each sentence is converted"] |
|
|
|
|
|
model = SentenceTransformer('smartmind/roberta-ko-small-tsdae') |
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings) |
|
|
``` |
|
|
|
|
|
λ€μμ sentence-transformersμ κΈ°λ₯μ μ¬μ©νμ¬ μ¬λ¬ λ¬Έμ₯μ μ μ¬λλ₯Ό ꡬνλ μμμ
λλ€. |
|
|
|
|
|
```python |
|
|
from sentence_transformers import util |
|
|
|
|
|
sentences = [ |
|
|
"λνλ―Όκ΅μ μλλ μμΈμ
λλ€.", |
|
|
"λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€.", |
|
|
"λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€.", |
|
|
"μμΈμ λνλ―Όκ΅μ μλμ
λλ€.", |
|
|
"μ€λ μμΈμ ν루μ’
μΌ λ§μ", |
|
|
] |
|
|
|
|
|
paraphrase = util.paraphrase_mining(model, sentences) |
|
|
for score, i, j in paraphrase: |
|
|
print(f"{sentences[i]}\t\t{sentences[j]}\t\t{score:.4f}") |
|
|
``` |
|
|
|
|
|
``` |
|
|
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.7616 |
|
|
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. 0.7031 |
|
|
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. 0.6594 |
|
|
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.6445 |
|
|
λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.4915 |
|
|
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. 0.4785 |
|
|
μμΈμ λνλ―Όκ΅μ μλμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.4119 |
|
|
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.3520 |
|
|
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.2550 |
|
|
λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.1896 |
|
|
``` |
|
|
|
|
|
|
|
|
## Usage (HuggingFace Transformers) |
|
|
|
|
|
[sentence-transformers](https://www.SBERT.net)λ₯Ό μ€μΉνμ§ μμ μνλ‘λ λ€μμ²λΌ μ¬μ©ν μ μμ΅λλ€. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
|
|
|
def cls_pooling(model_output, attention_mask): |
|
|
return model_output[0][:,0] |
|
|
|
|
|
|
|
|
# Sentences we want sentence embeddings for |
|
|
sentences = ['This is an example sentence', 'Each sentence is converted'] |
|
|
|
|
|
# Load model from HuggingFace Hub |
|
|
tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae') |
|
|
model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae') |
|
|
|
|
|
# Tokenize sentences |
|
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
|
|
# Compute token embeddings |
|
|
with torch.no_grad(): |
|
|
model_output = model(**encoded_input) |
|
|
|
|
|
# Perform pooling. In this case, cls pooling. |
|
|
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask']) |
|
|
|
|
|
print("Sentence embeddings:") |
|
|
print(sentence_embeddings) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
[klue](https://huggingface.co/datasets/klue) STS λ°μ΄ν°μ λν΄ λ€μ μ μλ₯Ό μ»μμ΅λλ€. μ΄ λ°μ΄ν°μ λν΄ νμΈνλνμ§ **μμ** μνλ‘ κ΅¬ν μ μμ
λλ€. |
|
|
|
|
|
|split|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman| |
|
|
|-----|--------------|---------------|-----------------|------------------|-----------------|------------------|-----------|------------| |
|
|
|train|0.8735|0.8676|0.8268|0.8357|0.8248|0.8336|0.8449|0.8383| |
|
|
|validation|0.5409|0.5349|0.4786|0.4657|0.4775|0.4625|0.5284|0.5252| |
|
|
|
|
|
|
|
|
## Full Model Architecture |
|
|
``` |
|
|
SentenceTransformer( |
|
|
(0): Transformer({'max_seq_length': 508, 'do_lower_case': False}) with Transformer model: RobertaModel |
|
|
(1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}) |
|
|
) |
|
|
``` |
|
|
|
|
|
## Citing & Authors |
|
|
|
|
|
<!--- Describe where people can find more information --> |
|
|
|