File size: 5,302 Bytes
3ed7855
 
 
 
 
 
 
e9a9cb9
 
 
 
 
 
 
 
 
 
3ed7855
 
 
 
 
 
e9a9cb9
 
 
 
 
3ed7855
 
 
57be1b1
3ed7855
 
 
 
 
57be1b1
3ed7855
 
 
57be1b1
3ed7855
 
 
 
 
 
 
57be1b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ed7855
 
 
57be1b1
 
3ed7855
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e9a9cb9
3ed7855
e9a9cb9
 
 
 
3ed7855
 
 
 
 
e9a9cb9
3ed7855
 
 
 
 
 
57be1b1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language:
- ko
license:
- mit
widget:
  source_sentence: "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€."
  sentences:
    - "미ꡭ의 μˆ˜λ„λŠ” λ‰΄μš•μ΄ μ•„λ‹™λ‹ˆλ‹€."
    - "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„ μš”κΈˆμ€ μ €λ ΄ν•œ νŽΈμž…λ‹ˆλ‹€."
    - "μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€."
---

# smartmind/roberta-ko-small-tsdae

This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 256 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Korean roberta small model pretrained with [TSDAE](https://arxiv.org/abs/2104.06979).

[TSDAE](https://arxiv.org/abs/2104.06979)둜 μ‚¬μ „ν•™μŠ΅λœ ν•œκ΅­μ–΄ robertaλͺ¨λΈμž…λ‹ˆλ‹€. λͺ¨λΈμ˜ κ΅¬μ‘°λŠ” [lassl/roberta-ko-small](https://huggingface.co/lassl/roberta-ko-small)κ³Ό λ™μΌν•©λ‹ˆλ‹€. ν† ν¬λ‚˜μ΄μ €λŠ” λ‹€λ¦…λ‹ˆλ‹€.

sentence-similarityλ₯Ό κ΅¬ν•˜λŠ” μš©λ„λ‘œ λ°”λ‘œ μ‚¬μš©ν•  μˆ˜λ„ 있고, λͺ©μ μ— 맞게 νŒŒμΈνŠœλ‹ν•˜μ—¬ μ‚¬μš©ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

## Usage (Sentence-Transformers)

[sentence-transformers](https://www.SBERT.net)λ₯Ό μ„€μΉ˜ν•œ λ’€, λͺ¨λΈμ„ λ°”λ‘œ 뢈러올 수 μžˆμŠ΅λ‹ˆλ‹€.

```
pip install -U sentence-transformers
```

이후 λ‹€μŒμ²˜λŸΌ λͺ¨λΈμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```python
from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('smartmind/roberta-ko-small-tsdae')
embeddings = model.encode(sentences)
print(embeddings)
```

λ‹€μŒμ€ sentence-transformers의 κΈ°λŠ₯을 μ‚¬μš©ν•˜μ—¬ μ—¬λŸ¬ λ¬Έμž₯의 μœ μ‚¬λ„λ₯Ό κ΅¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€.

```python
from sentence_transformers import util

sentences = [
    "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.",
    "미ꡭ의 μˆ˜λ„λŠ” λ‰΄μš•μ΄ μ•„λ‹™λ‹ˆλ‹€.",
    "λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„ μš”κΈˆμ€ μ €λ ΄ν•œ νŽΈμž…λ‹ˆλ‹€.",
    "μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€.",
    "였늘 μ„œμšΈμ€ ν•˜λ£¨μ’…μΌ λ§‘μŒ",
]

paraphrase = util.paraphrase_mining(model, sentences)
for score, i, j in paraphrase:
    print(f"{sentences[i]}\t\t{sentences[j]}\t\t{score:.4f}")
```

```
λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.		μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€.		0.7616
λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.		미ꡭ의 μˆ˜λ„λŠ” λ‰΄μš•μ΄ μ•„λ‹™λ‹ˆλ‹€.		0.7031
λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.		λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„ μš”κΈˆμ€ μ €λ ΄ν•œ νŽΈμž…λ‹ˆλ‹€.		0.6594
미ꡭ의 μˆ˜λ„λŠ” λ‰΄μš•μ΄ μ•„λ‹™λ‹ˆλ‹€.		μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€.		0.6445
λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„ μš”κΈˆμ€ μ €λ ΄ν•œ νŽΈμž…λ‹ˆλ‹€.		μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€.		0.4915
미ꡭ의 μˆ˜λ„λŠ” λ‰΄μš•μ΄ μ•„λ‹™λ‹ˆλ‹€.		λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„ μš”κΈˆμ€ μ €λ ΄ν•œ νŽΈμž…λ‹ˆλ‹€.		0.4785
μ„œμšΈμ€ λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„μž…λ‹ˆλ‹€.		였늘 μ„œμšΈμ€ ν•˜λ£¨μ’…μΌ λ§‘μŒ		0.4119
λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„λŠ” μ„œμšΈμž…λ‹ˆλ‹€.		였늘 μ„œμšΈμ€ ν•˜λ£¨μ’…μΌ λ§‘μŒ		0.3520
미ꡭ의 μˆ˜λ„λŠ” λ‰΄μš•μ΄ μ•„λ‹™λ‹ˆλ‹€.		였늘 μ„œμšΈμ€ ν•˜λ£¨μ’…μΌ λ§‘μŒ		0.2550
λŒ€ν•œλ―Όκ΅­μ˜ μˆ˜λ„ μš”κΈˆμ€ μ €λ ΄ν•œ νŽΈμž…λ‹ˆλ‹€.		였늘 μ„œμšΈμ€ ν•˜λ£¨μ’…μΌ λ§‘μŒ		0.1896
```


## Usage (HuggingFace Transformers)

[sentence-transformers](https://www.SBERT.net)λ₯Ό μ„€μΉ˜ν•˜μ§€ μ•Šμ€ μƒνƒœλ‘œλŠ” λ‹€μŒμ²˜λŸΌ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```python
from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('smartmind/roberta-ko-small-tsdae')
model = AutoModel.from_pretrained('smartmind/roberta-ko-small-tsdae')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)
```



## Evaluation Results

[klue](https://huggingface.co/datasets/klue) STS 데이터에 λŒ€ν•΄ λ‹€μŒ 점수λ₯Ό μ–»μ—ˆμŠ΅λ‹ˆλ‹€. 이 데이터에 λŒ€ν•΄ νŒŒμΈνŠœλ‹ν•˜μ§€ **μ•Šμ€** μƒνƒœλ‘œ κ΅¬ν•œ μ μˆ˜μž…λ‹ˆλ‹€.

|split|cosine_pearson|cosine_spearman|euclidean_pearson|euclidean_spearman|manhattan_pearson|manhattan_spearman|dot_pearson|dot_spearman|
|-----|--------------|---------------|-----------------|------------------|-----------------|------------------|-----------|------------|
|train|0.8735|0.8676|0.8268|0.8357|0.8248|0.8336|0.8449|0.8383|
|validation|0.5409|0.5349|0.4786|0.4657|0.4775|0.4625|0.5284|0.5252|


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 508, 'do_lower_case': False}) with Transformer model: RobertaModel
  (1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```

## Citing & Authors

<!--- Describe where people can find more information -->