bongsoo's picture
Update README.md
c3b41d6
metadata
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
  - ko
  - en
widget:
  source_sentence: ๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ˆ˜๋„๋Š”?
  sentences:
    - ์„œ์šธํŠน๋ณ„์‹œ๋Š” ํ•œ๊ตญ์ด ์ •์น˜,๊ฒฝ์ œ,๋ฌธํ™” ์ค‘์‹ฌ ๋„์‹œ์ด๋‹ค.
    - ๋ถ€์‚ฐ์€ ๋Œ€ํ•œ๋ฏผ๊ตญ์˜ ์ œ2์˜ ๋„์‹œ์ด์ž ์ตœ๋Œ€์˜ ํ•ด์–‘ ๋ฌผ๋ฅ˜ ๋„์‹œ์ด๋‹ค.
    - ์ œ์ฃผ๋„๋Š” ๋Œ€ํ•œ๋ฏผ๊ตญ์—์„œ ์œ ๋ช…ํ•œ ๊ด€๊ด‘์ง€์ด๋‹ค
    - Seoul is the capital of Korea
    - ์šธ์‚ฐ๊ด‘์—ญ์‹œ๋Š” ๋Œ€ํ•œ๋ฏผ๊ตญ ๋‚จ๋™๋ถ€ ํ•ด์•ˆ์— ์žˆ๋Š” ๊ด‘์—ญ์‹œ์ด๋‹ค

moco-sentencebertV2.0

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

  • ์ด ๋ชจ๋ธ์€ bongsoo/mbertV2.0 MLM ๋ชจ๋ธ์„
    sentencebert๋กœ ๋งŒ๋“  ํ›„,์ถ”๊ฐ€์ ์œผ๋กœ STS Tearch-student ์ฆ๋ฅ˜ ํ•™์Šต ์‹œ์ผœ ๋งŒ๋“  ๋ชจ๋ธ ์ž…๋‹ˆ๋‹ค.
  • vocab: 152,537 ๊ฐœ(๊ธฐ์กด 119,548 vocab ์— 32,989 ์‹ ๊ทœ vocab ์ถ”๊ฐ€)

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence_transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bongsoo/moco-sentencebertV2.0')
embeddings = model.encode(sentences)
print(embeddings)

# sklearn ์„ ์ด์šฉํ•˜์—ฌ cosine_scores๋ฅผ ๊ตฌํ•จ
# => ์ž…๋ ฅ๊ฐ’ embeddings ์€ (1,768) ์ฒ˜๋Ÿผ 2D ์—ฌ์•ผ ํ•จ.
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(embeddings[0].reshape(1,-1), embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')

์ถœ๋ ฅ(Outputs)

[[ 0.16649279 -0.2933038  -0.00391259 ...  0.00720964  0.18175027  -0.21052675]
 [ 0.10106096 -0.11454111 -0.00378215 ... -0.009032   -0.2111504   -0.15030429]]
*cosine_score:0.3352515697479248

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/moco-sentencebertV2.0')
model = AutoModel.from_pretrained('bongsoo/moco-sentencebertV2.0')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

# sklearn ์„ ์ด์šฉํ•˜์—ฌ cosine_scores๋ฅผ ๊ตฌํ•จ
# => ์ž…๋ ฅ๊ฐ’ embeddings ์€ (1,768) ์ฒ˜๋Ÿผ 2D ์—ฌ์•ผ ํ•จ.
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances
cosine_scores = 1 - (paired_cosine_distances(sentence_embeddings[0].reshape(1,-1), sentence_embeddings[1].reshape(1,-1)))

print(f'*cosine_score:{cosine_scores[0]}')

์ถœ๋ ฅ(Outputs)

Sentence embeddings:
tensor([[ 0.1665, -0.2933, -0.0039,  ...,  0.0072,  0.1818, -0.2105],
        [ 0.1011, -0.1145, -0.0038,  ..., -0.0090, -0.2112, -0.1503]])
*cosine_score:0.3352515697479248

Evaluation Results

  • ์„ฑ๋Šฅ ์ธก์ •์„ ์œ„ํ•œ ๋ง๋ญ‰์น˜๋Š”, ์•„๋ž˜ ํ•œ๊ตญ์–ด (kor), ์˜์–ด(en) ํ‰๊ฐ€ ๋ง๋ญ‰์น˜๋ฅผ ์ด์šฉํ•จ
    ํ•œ๊ตญ์–ด : korsts(1,379์Œ๋ฌธ์žฅ) ์™€ klue-sts(519์Œ๋ฌธ์žฅ)
    ์˜์–ด : stsb_multi_mt(1,376์Œ๋ฌธ์žฅ) ์™€ glue:stsb (1,500์Œ๋ฌธ์žฅ)
  • ์„ฑ๋Šฅ ์ง€ํ‘œ๋Š” cosin.spearman ์ธก์ •ํ•˜์—ฌ ๋น„๊ตํ•จ.
  • ํ‰๊ฐ€ ์ธก์ • ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ ์ฐธ์กฐ
๋ชจ๋ธ korsts klue-sts korsts+klue-sts stsb_multi_mt glue(stsb)
distiluse-base-multilingual-cased-v2 0.747 0.785 0.577 0.807 0.819
paraphrase-multilingual-mpnet-base-v2 0.820 0.799 0.711 0.868 0.890
bongsoo/sentencedistilbertV1.2 0.819 0.858 0.630 0.837 0.873
bongsoo/moco-sentencedistilbertV2.0 0.812 0.847 0.627 0.837 0.877
bongsoo/moco-sentencebertV2.0 0.824 0.841 0.635 0.843 0.879

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training(ํ›ˆ๋ จ ๊ณผ์ •)

The model was trained with the parameters:

1. MLM ํ›ˆ๋ จ

  • ์ž…๋ ฅ ๋ชจ๋ธ : bert-base-multilingual-cased
  • ๋ง๋ญ‰์น˜ : ํ›ˆ๋ จ : bongsoo/moco-corpus-kowiki2022(7.6M) , ํ‰๊ฐ€: bongsoo/bongevalsmall
  • HyperParameter : LearningRate : 5e-5, epochs: 8, batchsize: 32, max_token_len : 128
  • vocab : 152,537๊ฐœ (๊ธฐ์กด 119,548 ์— 32,989 ์‹ ๊ทœ vocab ์ถ”๊ฐ€)
  • ์ถœ๋ ฅ ๋ชจ๋ธ : mbertV2.0 (size: 813MB)
  • ํ›ˆ๋ จ์‹œ๊ฐ„ : 90h/1GPU (24GB/19.6GB use)
  • loss : ํ›ˆ๋ จloss: 2.258400, ํ‰๊ฐ€loss: 3.102096, perplexity: 19.78158(bong_eval:1,500)
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ

2. STS ํ›ˆ๋ จ =>bert๋ฅผ sentencebert๋กœ ๋งŒ๋“ฌ.

  • ์ž…๋ ฅ ๋ชจ๋ธ : mbertV2.0
  • ๋ง๋ญ‰์น˜ : korsts + kluestsV1.1 + stsb_multi_mt + mteb/sickr-sts (์ด:33,093)
  • HyperParameter : LearningRate : 3e-5, epochs: 200, batchsize: 32, max_token_len : 128
  • ์ถœ๋ ฅ ๋ชจ๋ธ : sbert-mbertV2.0 (size: 813MB)
  • ํ›ˆ๋ จ์‹œ๊ฐ„ : 9h20m/1GPU (24GB/9.0GB use)
  • loss(cosin_spearman) : 0.799(๋ง๋ญ‰์น˜:korsts(tune_test.tsv))
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ

3.์ฆ๋ฅ˜(distilation) ํ›ˆ๋ จ

  • ํ•™์ƒ ๋ชจ๋ธ : sbert-mbertV2.0
  • ๊ต์‚ฌ ๋ชจ๋ธ : paraphrase-multilingual-mpnet-base-v2
  • ๋ง๋ญ‰์น˜ : en_ko_train.tsv(ํ•œ๊ตญ์–ด-์˜์–ด ์‚ฌํšŒ๊ณผํ•™๋ถ„์•ผ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜ : 1.1M)
  • HyperParameter : LearningRate : 5e-5, epochs: 40, batchsize: 128, max_token_len : 128
  • ์ถœ๋ ฅ ๋ชจ๋ธ : sbert-mlbertV2.0-distil
  • ํ›ˆ๋ จ์‹œ๊ฐ„ : 17h/1GPU (24GB/18.6GB use)
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ

4.STS ํ›ˆ๋ จ => sentencebert ๋ชจ๋ธ์„ sts ํ›ˆ๋ จ์‹œํ‚ด

  • ์ž…๋ ฅ ๋ชจ๋ธ : sbert-mlbertV2.0-distil
  • ๋ง๋ญ‰์น˜ : korsts(5,749) + kluestsV1.1(11,668) + stsb_multi_mt(5,749) + mteb/sickr-sts(9,927) + glue stsb(5,749) (์ด:38,842)
  • HyperParameter : LearningRate : 3e-5, epochs: 800, batchsize: 64, max_token_len : 128
  • ์ถœ๋ ฅ ๋ชจ๋ธ : moco-sentencebertV2.0
  • ํ›ˆ๋ จ์‹œ๊ฐ„ : 25h/1GPU (24GB/13GB use)
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ


๋ชจ๋ธ ์ œ์ž‘ ๊ณผ์ •์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์—ฌ๊ธฐ๋ฅผ ์ฐธ์กฐ ํ•˜์„ธ์š”.

DataLoader:

torch.utils.data.dataloader.DataLoader of length 1035 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Config:

{
  "_name_or_path": "../../data11/model/sbert/sbert-mbertV2.0-distil",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.21.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 152537
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

bongsoo