| | --- |
| | language: |
| | - ko |
| | license: gpl-3.0 |
| | tags: |
| | - bert |
| | - masked-language-model |
| | - korean |
| | - pretrained |
| | metrics: |
| | - perplexity |
| | pipeline_tag: fill-mask |
| | model-index: |
| | - name: bert-ko-pretrained |
| | results: |
| | - task: |
| | type: fill-mask |
| | name: Masked Language Modeling |
| | metrics: |
| | - name: Eval Loss |
| | type: loss |
| | value: 3.6679 |
| | - name: Eval Perplexity |
| | type: perplexity |
| | value: 39.17 |
| | --- |
| | |
| | # bert-ko-pretrained |
| |
|
| | ํ๊ตญ์ด ํ
์คํธ๋ก ์ฌ์ ํ์ต๋ BERT (Masked Language Model) ์
๋๋ค. |
| |
|
| | ## ๋ชจ๋ธ ์ ๋ณด |
| |
|
| | | ํญ๋ชฉ | ๊ฐ | |
| | |------|-----| |
| | | Architecture | BertForMaskedLM | |
| | | Hidden Size | 256 | |
| | | Layers | 4 | |
| | | Attention Heads | 4 | |
| | | Intermediate Size | 1024 | |
| | | Vocab Size | 32,000 | |
| | | Max Length | 256 tokens | |
| | | Parameters | 11,515,904 | |
| | | Total Steps | 50,000 | |
| |
|
| | ## ์ฌ์ ํ์ต ์ฑ๋ฅ (MLM) |
| |
|
| | | Split | Loss | Perplexity | |
| | |-------|-----:|-----------:| |
| | | Eval | 3.6679 | 39.17 | |
| |
|
| | ## ํ์ต ์ฝํผ์ค |
| |
|
| | | ์ฝํผ์ค | ํฌ๊ธฐ | ์ค๋ช
| |
| | |--------|------|------| |
| | | injection_corpus.txt | 65MB | ํ๋กฌํํธ ์ธ์ ์
๋ฐ์ดํฐ | |
| | | external_all.txt | 9.6MB | KoSBi v2 + K-MHaS + BEEP\! | |
| | | all_combined.txt | 15MB | ์ ์ฒด ํตํฉ ์ฝํผ์ค | |
| | |
| | **์ด ~90MB** ํ๊ตญ์ด ํ
์คํธ |
| | |
| | ## ์ฌ์ฉ ๋ฐฉ๋ฒ |
| | |
| | ### Fill-Mask |
| | |
| | |
| | |
| | ### ๋ถ๋ฅ ๋ชจ๋ธ ๋ฐฑ๋ณธ์ผ๋ก ์ฌ์ฉ |
| | |
| | |
| | |
| | ## ํ์ต ์ค์ |
| | |
| | - **Tokenizer**: WordPiece (vocab_size=32,000) |
| | - **Optimizer**: AdamW |
| | - **Scheduler**: Cosine with warmup |
| | - **MLM Probability**: 15% |
| |
|
| | ## ๋ผ์ด์ ์ค |
| |
|
| | GPL-3.0 License |
| |
|