File size: 5,302 Bytes
3ed7855 e9a9cb9 3ed7855 e9a9cb9 3ed7855 57be1b1 3ed7855 57be1b1 3ed7855 57be1b1 3ed7855 57be1b1 3ed7855 57be1b1 3ed7855 e9a9cb9 3ed7855 e9a9cb9 3ed7855 e9a9cb9 3ed7855 57be1b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language:
- ko
license:
- mit
widget:
source_sentence: "λνλ―Όκ΅μ μλλ μμΈμ
λλ€."
sentences:
- "λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€."
- "λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€."
- "μμΈμ λνλ―Όκ΅μ μλμ
λλ€."
---
# smartmind/roberta-ko-small-tsdae
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 256 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Korean roberta small model pretrained with [TSDAE](https://arxiv.org/abs/2104.06979).
[TSDAE](https://arxiv.org/abs/2104.06979)λ‘ μ¬μ νμ΅λ νκ΅μ΄ robertaλͺ¨λΈμ
λλ€. λͺ¨λΈμ ꡬ쑰λ [lassl/roberta-ko-small](https://huggingface.co/lassl/roberta-ko-small)κ³Ό λμΌν©λλ€. ν ν¬λμ΄μ λ λ€λ¦
λλ€.
sentence-similarityλ₯Ό ꡬνλ μ©λλ‘ λ°λ‘ μ¬μ©ν μλ μκ³ , λͺ©μ μ λ§κ² νμΈνλνμ¬ μ¬μ©ν μλ μμ΅λλ€.
## Usage (Sentence-Transformers)
[sentence-transformers](https://www.SBERT.net)λ₯Ό μ€μΉν λ€, λͺ¨λΈμ λ°λ‘ λΆλ¬μ¬ μ μμ΅λλ€.
```
pip install -U sentence-transformers
```
μ΄ν λ€μμ²λΌ λͺ¨λΈμ μ¬μ©ν μ μμ΅λλ€.
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('smartmind/roberta-ko-small-tsdae')
embeddings = model.encode(sentences)
print(embeddings)
```
λ€μμ sentence-transformersμ κΈ°λ₯μ μ¬μ©νμ¬ μ¬λ¬ λ¬Έμ₯μ μ μ¬λλ₯Ό ꡬνλ μμμ
λλ€.
```python
from sentence_transformers import util
sentences = [
"λνλ―Όκ΅μ μλλ μμΈμ
λλ€.",
"λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€.",
"λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€.",
"μμΈμ λνλ―Όκ΅μ μλμ
λλ€.",
"μ€λ μμΈμ ν루μ’
μΌ λ§μ",
]
paraphrase = util.paraphrase_mining(model, sentences)
for score, i, j in paraphrase:
print(f"{sentences[i]}\t\t{sentences[j]}\t\t{score:.4f}")
```
```
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.7616
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. 0.7031
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. 0.6594
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.6445
λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. μμΈμ λνλ―Όκ΅μ μλμ
λλ€. 0.4915
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. 0.4785
μμΈμ λνλ―Όκ΅μ μλμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.4119
λνλ―Όκ΅μ μλλ μμΈμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.3520
λ―Έκ΅μ μλλ λ΄μμ΄ μλλλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.2550
λνλ―Όκ΅μ μλ μκΈμ μ λ ΄ν νΈμ
λλ€. μ€λ μμΈμ ν루μ’
μΌ λ§μ 0.1896
```
## Usage (HuggingFace Transformers)
[sentence-transformers](https://www.SBERT.net)λ₯Ό μ€μΉνμ§ μμ μνλ‘λ λ€μμ²λΌ μ¬μ©ν μ μμ΅λλ€.
```python
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae')
model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Evaluation Results
[klue](https://huggingface.co/datasets/klue) STS λ°μ΄ν°μ λν΄ λ€μ μ μλ₯Ό μ»μμ΅λλ€. μ΄ λ°μ΄ν°μ λν΄ νμΈνλνμ§ **μμ** μνλ‘ κ΅¬ν μ μμ
λλ€.
|split|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman|
|-----|--------------|---------------|-----------------|------------------|-----------------|------------------|-----------|------------|
|train|0.8735|0.8676|0.8268|0.8357|0.8248|0.8336|0.8449|0.8383|
|validation|0.5409|0.5349|0.4786|0.4657|0.4775|0.4625|0.5284|0.5252|
## Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 508, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```
## Citing & Authors
<!--- Describe where people can find more information -->
|