YAML Metadata Error: "widget" must be an array

Bingsu/bigbird_ko_base-tsdae-specialty_corpus

sentence-transformers๋กœ ํ•™์Šต๋œ bigbird ๋ชจ๋ธ: ์ž…๋ ฅ ๋ฌธ์žฅ์„ 256๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

Aihub ์ „๋ฌธ๋ถ„์•ผ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•ด TSDAE๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Usage (Sentence-Transformers)

์‚ฌ์šฉ ์ „์— sentence-transformers๋ฅผ ์„ค์น˜ํ•˜์„ธ์š”.

pip install -U sentence-transformers

๋˜๋Š”

conda install -c conda-forge sentence-transformers

์‚ฌ์šฉ ์˜ˆ์ œ:

from sentence_transformers import util

sent = [
    "๋ณธ ๋…ผ๋ฌธ์€ ๋””์ง€ํ„ธ ์‹ ํ˜ธ์ฒ˜๋ฆฌ์šฉ VLSI์˜ ์ž๋™์„ค๊ณ„๋ฅผ ์œ„ํ•œ SODAS-DSP(SOgang Design Automation System-DSP) ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„์™€ ๊ฐœ๋ฐœ ๊ฒฐ๊ณผ์— ๋Œ€ํ•˜์—ฌ ๊ธฐ์ˆ ํ•œ๋‹ค",
    "๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” DD-Gardner๋ฐฉ์‹์˜ ํƒ€์ด๋ฐ ๊ฒ€์ถœ๊ธฐ ์„ฑ๋Šฅ์„ ๊ณ ์ฐฐํ•œ๋‹ค.",
    "์ด๋Ÿฌํ•œ ํ•ด์„๋ฐฉ๋ฒ•์€ ๋งค์šฐ ๋ณต์žกํ•œ ๊ฒƒ์ด์–ด์„œ ์ˆ˜์น˜ ํ•ด์„ ํ”„๋กœ๊ทธ๋žจ์ด ํ•„์ˆ˜์  ์ด๋‹ค.",
    "์ˆ˜์น˜ ํ•ด์„ ํ”„๋กœ๊ทธ๋žจ์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ž…๋ ฅํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์ผ๋ฐ˜์ธ์ด ์‚ฌ์šฉํ•˜๊ธฐ์—๋Š” ๋งŽ์€ ์–ด๋ ค์›€์ด ์žˆ๋‹ค.",
    "๋˜ ์‚ฐ๋ž€๊ณผ ํˆฌ๊ณผ์— ๋Œ€ํ•œ ๊ณ ์ฃผํŒŒ ๊ทผ์‚ฌ์‹๋„ ์–ป์–ด์ง„๋‹ค.",
    "๊ทธ๋ฆฌ๊ณ  ์Šฌ๋ฆฟ๊ฐ„์˜ ๊ฐ„๊ฒฉ์˜ ๋ณ€ํ™”์— ์˜ํ•ด์„œ ๋น”ํญ(beamwidth)์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.",
    "์˜ค๋Š˜ ์ ์‹ฌ์€ ์งœ์žฅ๋ฉด์ด๋‹ค.",
    "์˜ค๋Š˜ ์ €๋…์€ ๊น€๋ฐฅ์ฒœ๊ตญ์ด๋‹ค."
]


paraphrases = util.paraphrase_mining(model, sent)

for paraphrase in paraphrases[:5]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sent[i], sent[j], score))
์ด๋Ÿฌํ•œ ํ•ด์„๋ฐฉ๋ฒ•์€ ๋งค์šฐ ๋ณต์žกํ•œ ๊ฒƒ์ด์–ด์„œ ์ˆ˜์น˜ ํ•ด์„ ํ”„๋กœ๊ทธ๋žจ์ด ํ•„์ˆ˜์  ์ด๋‹ค. 		 ์ˆ˜์น˜ ํ•ด์„ ํ”„๋กœ๊ทธ๋žจ์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ž…๋ ฅํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์ผ๋ฐ˜์ธ์ด ์‚ฌ์šฉํ•˜๊ธฐ์—๋Š” ๋งŽ์€ ์–ด๋ ค์›€์ด ์žˆ๋‹ค. 		 Score: 0.8990
์˜ค๋Š˜ ์ ์‹ฌ์€ ์งœ์žฅ๋ฉด์ด๋‹ค. 		 ์˜ค๋Š˜ ์ €๋…์€ ๊น€๋ฐฅ์ฒœ๊ตญ์ด๋‹ค. 		 Score: 0.8945
์ˆ˜์น˜ ํ•ด์„ ํ”„๋กœ๊ทธ๋žจ์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ž…๋ ฅํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์ผ๋ฐ˜์ธ์ด ์‚ฌ์šฉํ•˜๊ธฐ์—๋Š” ๋งŽ์€ ์–ด๋ ค์›€์ด ์žˆ๋‹ค. 		 ์˜ค๋Š˜ ์ €๋…์€ ๊น€๋ฐฅ์ฒœ๊ตญ์ด๋‹ค. 		 Score: 0.8901
๋ณธ ๋…ผ๋ฌธ์€ ๋””์ง€ํ„ธ ์‹ ํ˜ธ์ฒ˜๋ฆฌ์šฉ VLSI์˜ ์ž๋™์„ค๊ณ„๋ฅผ ์œ„ํ•œ SODAS-DSP(SOgang Design Automation System-DSP) ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„์™€ ๊ฐœ๋ฐœ ๊ฒฐ๊ณผ์— ๋Œ€ํ•˜์—ฌ ๊ธฐ์ˆ ํ•œ๋‹ค 		 ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” DD-Gardner๋ฐฉ์‹์˜ ํƒ€์ด๋ฐ ๊ฒ€์ถœ๊ธฐ ์„ฑ๋Šฅ์„ ๊ณ ์ฐฐํ•œ๋‹ค. 		 Score: 0.8894
๋ณธ ๋…ผ๋ฌธ์€ ๋””์ง€ํ„ธ ์‹ ํ˜ธ์ฒ˜๋ฆฌ์šฉ VLSI์˜ ์ž๋™์„ค๊ณ„๋ฅผ ์œ„ํ•œ SODAS-DSP(SOgang Design Automation System-DSP) ์‹œ์Šคํ…œ์˜ ์„ค๊ณ„์™€ ๊ฐœ๋ฐœ ๊ฒฐ๊ณผ์— ๋Œ€ํ•˜์—ฌ ๊ธฐ์ˆ ํ•œ๋‹ค 		 ๊ทธ๋ฆฌ๊ณ  ์Šฌ๋ฆฟ๊ฐ„์˜ ๊ฐ„๊ฒฉ์˜ ๋ณ€ํ™”์— ์˜ํ•ด์„œ ๋น”ํญ(beamwidth)์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค. 		 Score: 0.8889

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('Bingsu/bigbird_ko_base-tsdae-specialty_corpus')
model = AutoModel.from_pretrained('Bingsu/bigbird_ko_base-tsdae-specialty_corpus')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 183287 with parameters:

{'batch_size': 8, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.DenoisingAutoEncoderLoss.DenoisingAutoEncoderLoss

Parameters of the fit()-Method:

{
    "epochs": 2,
    "evaluation_steps": 10000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'bitsandbytes.optim.adamw.AdamW8bit'>",
    "optimizer_params": {
        "lr": 3e-05
    },
    "scheduler": "warmupcosinewithhardrestarts",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.005
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: BigBirdModel 
  (1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support