DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

BERT[[BERT]]

๊ฐœ์š”[[Overview]]

BERT ๋ชจ๋ธ์€ Jacob Devlin. Ming-Wei Chang, Kenton Lee, Kristina Touranova๊ฐ€ ์ œ์•ˆํ•œ ๋…ผ๋ฌธ BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. BERT๋Š” ์‚ฌ์ „ ํ•™์Šต๋œ ์–‘๋ฐฉํ–ฅ ํŠธ๋žœ์Šคํฌ๋จธ๋กœ, Toronto Book Corpus์™€ Wikipedia๋กœ ๊ตฌ์„ฑ๋œ ๋Œ€๊ทœ๋ชจ ์ฝ”ํผ์Šค์—์„œ ๋งˆ์Šคํ‚น๋œ ์–ธ์–ด ๋ชจ๋ธ๋ง๊ณผ ๋‹ค์Œ ๋ฌธ์žฅ ์˜ˆ์ธก(Next Sentence Prediction) ๋ชฉํ‘œ๋ฅผ ๊ฒฐํ•ฉํ•ด ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ์ดˆ๋ก์ž…๋‹ˆ๋‹ค:

์šฐ๋ฆฌ๋Š” BERT(Bidirectional Encoder Representations from Transformers)๋ผ๋Š” ์ƒˆ๋กœ์šด ์–ธ์–ด ํ‘œํ˜„ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ตœ๊ทผ์˜ ๋‹ค๋ฅธ ์–ธ์–ด ํ‘œํ˜„ ๋ชจ๋ธ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, BERT๋Š” ๋ชจ๋“  ๊ณ„์ธต์—์„œ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์–‘์ชฝ ๋ฌธ๋งฅ์„ ์กฐ๊ฑด์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์ง€๋„ ํ•™์Šต๋œ ํ…์ŠคํŠธ์—์„œ ๊นŠ์ด ์žˆ๋Š” ์–‘๋ฐฉํ–ฅ ํ‘œํ˜„์„ ์‚ฌ์ „ ํ•™์Šตํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ์‚ฌ์ „ ํ•™์Šต๋œ BERT ๋ชจ๋ธ์€ ์ถ”๊ฐ€์ ์ธ ์ถœ๋ ฅ ๊ณ„์ธต ํ•˜๋‚˜๋งŒ์œผ๋กœ ์งˆ๋ฌธ ์‘๋‹ต, ์–ธ์–ด ์ถ”๋ก ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ๋ฏธ์„ธ ์กฐ์ •๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ํŠน์ • ์ž‘์—…์„ ์œ„ํ•ด ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ˆ˜์ •ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

BERT๋Š” ๊ฐœ๋…์ ์œผ๋กœ ๋‹จ์ˆœํ•˜๋ฉด์„œ๋„ ์‹ค์ฆ์ ์œผ๋กœ ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. BERT๋Š” 11๊ฐœ์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๊ณผ์ œ์—์„œ ์ƒˆ๋กœ์šด ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์œผ๋ฉฐ, GLUE ์ ์ˆ˜๋ฅผ 80.5% (7.7% ํฌ์ธํŠธ ์ ˆ๋Œ€ ๊ฐœ์„ )๋กœ, MultiNLI ์ •ํ™•๋„๋ฅผ 86.7% (4.6% ํฌ์ธํŠธ ์ ˆ๋Œ€ ๊ฐœ์„ ), SQuAD v1.1 ์งˆ๋ฌธ ์‘๋‹ต ํ…Œ์ŠคํŠธ์—์„œ F1 ์ ์ˆ˜๋ฅผ 93.2 (1.5% ํฌ์ธํŠธ ์ ˆ๋Œ€ ๊ฐœ์„ )๋กœ, SQuAD v2.0์—์„œ F1 ์ ์ˆ˜๋ฅผ 83.1 (5.1% ํฌ์ธํŠธ ์ ˆ๋Œ€ ๊ฐœ์„ )๋กœ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

์ด ๋ชจ๋ธ์€ thomwolf๊ฐ€ ๊ธฐ์—ฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์›๋ณธ ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ์šฉ ํŒ[[Usage tips]]

  • BERT๋Š” ์ ˆ๋Œ€ ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์ด๋ฏ€๋กœ ์ž…๋ ฅ์„ ์™ผ์ชฝ์ด ์•„๋‹ˆ๋ผ ์˜ค๋ฅธ์ชฝ์—์„œ ํŒจ๋”ฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค.

  • BERT๋Š” ๋งˆ์Šคํ‚น๋œ ์–ธ์–ด ๋ชจ๋ธ(MLM)๊ณผ Next Sentence Prediction(NSP) ๋ชฉํ‘œ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋งˆ์Šคํ‚น๋œ ํ† ํฐ ์˜ˆ์ธก๊ณผ ์ „๋ฐ˜์ ์ธ ์ž์—ฐ์–ด ์ดํ•ด(NLU)์— ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ํ…์ŠคํŠธ ์ƒ์„ฑ์—๋Š” ์ตœ์ ํ™”๋˜์–ด์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

  • BERT์˜ ์‚ฌ์ „ ํ•™์Šต ๊ณผ์ •์—์„œ๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋งˆ์Šคํ‚นํ•˜์—ฌ ์ผ๋ถ€ ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ํ† ํฐ ์ค‘ ์•ฝ 15%๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋งˆ์Šคํ‚น๋ฉ๋‹ˆ๋‹ค:

    • 80% ํ™•๋ฅ ๋กœ ๋งˆ์Šคํฌ ํ† ํฐ์œผ๋กœ ๋Œ€์ฒด
    • 10% ํ™•๋ฅ ๋กœ ์ž„์˜์˜ ๋‹ค๋ฅธ ํ† ํฐ์œผ๋กœ ๋Œ€์ฒด
    • 10% ํ™•๋ฅ ๋กœ ์›๋ž˜ ํ† ํฐ ๊ทธ๋Œ€๋กœ ์œ ์ง€
  • ๋ชจ๋ธ์˜ ์ฃผ์š” ๋ชฉํ‘œ๋Š” ์›๋ณธ ๋ฌธ์žฅ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ, ๋‘ ๋ฒˆ์งธ ๋ชฉํ‘œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: ์ž…๋ ฅ์œผ๋กœ ๋ฌธ์žฅ A์™€ B (์‚ฌ์ด์—๋Š” ๊ตฌ๋ถ„ ํ† ํฐ์ด ์žˆ์Œ)๊ฐ€ ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค. ์ด ๋ฌธ์žฅ ์Œ์ด ์—ฐ์†๋  ํ™•๋ฅ ์€ 50%์ด๋ฉฐ, ๋‚˜๋จธ์ง€ 50%๋Š” ์„œ๋กœ ๋ฌด๊ด€ํ•œ ๋ฌธ์žฅ๋“ค์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ์ด ๋‘ ๋ฌธ์žฅ์ด ์•„๋‹Œ์ง€๋ฅผ ์˜ˆ์ธกํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Scaled Dot Product Attention(SDPA) ์‚ฌ์šฉํ•˜๊ธฐ [[Using Scaled Dot Product Attention (SDPA)]]

Pytorch๋Š” torch.nn.functional์˜ ์ผ๋ถ€๋กœ Scaled Dot Product Attention(SDPA) ์—ฐ์‚ฐ์ž๋ฅผ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์ž…๋ ฅ๊ณผ ํ•˜๋“œ์›จ์–ด์— ๋”ฐ๋ผ ์—ฌ๋Ÿฌ ๊ตฌํ˜„ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๊ณต์‹ ๋ฌธ์„œ๋‚˜ GPU Inference์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

torch>=2.1.1์—์„œ๋Š” ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ SDPA๊ฐ€ ๊ธฐ๋ณธ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์ง€๋งŒ, from_pretrained()ํ•จ์ˆ˜์—์„œ attn_implementation="sdpa"๋ฅผ ์„ค์ •ํ•˜์—ฌ SDPA๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋„๋ก ์ง€์ •ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased", torch_dtype=torch.float16, attn_implementation="sdpa")
...

์ตœ์  ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋ชจ๋ธ์„ ๋ฐ˜์ •๋ฐ€๋„(์˜ˆ: torch.float16 ๋˜๋Š” torch.bfloat16)๋กœ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

๋กœ์ปฌ ๋ฒค์น˜๋งˆํฌ (A100-80GB, CPUx12, RAM 96.6GB, PyTorch 2.2.0, OS Ubuntu 22.04)์—์„œ float16์„ ์‚ฌ์šฉํ•ด ํ•™์Šต ๋ฐ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์†๋„ ํ–ฅ์ƒ์ด ๊ด€์ฐฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต [[Training]]

batch_size seq_len Time per batch (eager - s) Time per batch (sdpa - s) Speedup (%) Eager peak mem (MB) sdpa peak mem (MB) Mem saving (%)
4 256 0.023 0.017 35.472 939.213 764.834 22.800
4 512 0.023 0.018 23.687 1970.447 1227.162 60.569
8 256 0.023 0.018 23.491 1594.295 1226.114 30.028
8 512 0.035 0.025 43.058 3629.401 2134.262 70.054
16 256 0.030 0.024 25.583 2874.426 2134.262 34.680
16 512 0.064 0.044 46.223 6964.659 3961.013 75.830

์ถ”๋ก  [[Inference]]

batch_size seq_len Per token latency eager (ms) Per token latency SDPA (ms) Speedup (%) Mem eager (MB) Mem BT (MB) Mem saved (%)
1 128 5.736 4.987 15.022 282.661 282.924 -0.093
1 256 5.689 4.945 15.055 298.686 298.948 -0.088
2 128 6.154 4.982 23.521 314.523 314.785 -0.083
2 256 6.201 4.949 25.303 347.546 347.033 0.148
4 128 6.049 4.987 21.305 378.895 379.301 -0.107
4 256 6.285 5.364 17.166 443.209 444.382 -0.264

์ž๋ฃŒ[[Resources]]

BERT๋ฅผ ์‹œ์ž‘ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” Hugging Face์™€ community ์ž๋ฃŒ ๋ชฉ๋ก(๐ŸŒŽ๋กœ ํ‘œ์‹œ๋จ) ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ํฌํ•จ๋  ์ž๋ฃŒ๋ฅผ ์ œ์ถœํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด PR(Pull Request)๋ฅผ ์—ด์–ด์ฃผ์„ธ์š”. ๋ฆฌ๋ทฐ ํ•ด๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค! ์ž๋ฃŒ๋Š” ๊ธฐ์กด ์ž๋ฃŒ๋ฅผ ๋ณต์ œํ•˜๋Š” ๋Œ€์‹  ์ƒˆ๋กœ์šด ๋‚ด์šฉ์„ ๋‹ด๊ณ  ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์ค‘ ์„ ํƒ

โšก๏ธ ์ถ”๋ก 

โš™๏ธ ์‚ฌ์ „ ํ•™์Šต

๐Ÿš€ ๋ฐฐํฌ

BertConfig

[[autodoc]] BertConfig - all

BertTokenizer

[[autodoc]] BertTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

BertTokenizerFast

[[autodoc]] BertTokenizerFast

TFBertTokenizer

[[autodoc]] TFBertTokenizer

Bert specific outputs

[[autodoc]] models.bert.modeling_bert.BertForPreTrainingOutput

[[autodoc]] models.bert.modeling_tf_bert.TFBertForPreTrainingOutput

[[autodoc]] models.bert.modeling_flax_bert.FlaxBertForPreTrainingOutput

BertModel

[[autodoc]] BertModel - forward

BertForPreTraining

[[autodoc]] BertForPreTraining - forward

BertLMHeadModel

[[autodoc]] BertLMHeadModel - forward

BertForMaskedLM

[[autodoc]] BertForMaskedLM - forward

BertForNextSentencePrediction

[[autodoc]] BertForNextSentencePrediction - forward

BertForSequenceClassification

[[autodoc]] BertForSequenceClassification - forward

BertForMultipleChoice

[[autodoc]] BertForMultipleChoice - forward

BertForTokenClassification

[[autodoc]] BertForTokenClassification - forward

BertForQuestionAnswering

[[autodoc]] BertForQuestionAnswering - forward

TFBertModel

[[autodoc]] TFBertModel - call

TFBertForPreTraining

[[autodoc]] TFBertForPreTraining - call

TFBertModelLMHeadModel

[[autodoc]] TFBertLMHeadModel - call

TFBertForMaskedLM

[[autodoc]] TFBertForMaskedLM - call

TFBertForNextSentencePrediction

[[autodoc]] TFBertForNextSentencePrediction - call

TFBertForSequenceClassification

[[autodoc]] TFBertForSequenceClassification - call

TFBertForMultipleChoice

[[autodoc]] TFBertForMultipleChoice - call

TFBertForTokenClassification

[[autodoc]] TFBertForTokenClassification - call

TFBertForQuestionAnswering

[[autodoc]] TFBertForQuestionAnswering - call

FlaxBertModel

[[autodoc]] FlaxBertModel - call

FlaxBertForPreTraining

[[autodoc]] FlaxBertForPreTraining - call

FlaxBertForCausalLM

[[autodoc]] FlaxBertForCausalLM - call

FlaxBertForMaskedLM

[[autodoc]] FlaxBertForMaskedLM - call

FlaxBertForNextSentencePrediction

[[autodoc]] FlaxBertForNextSentencePrediction - call

FlaxBertForSequenceClassification

[[autodoc]] FlaxBertForSequenceClassification - call

FlaxBertForMultipleChoice

[[autodoc]] FlaxBertForMultipleChoice - call

FlaxBertForTokenClassification

[[autodoc]] FlaxBertForTokenClassification - call

FlaxBertForQuestionAnswering

[[autodoc]] FlaxBertForQuestionAnswering - call