bert-ko-pretrained / README.md
prismdata's picture
Upload README.md with huggingface_hub
930d692 verified
---
language:
- ko
license: gpl-3.0
tags:
- bert
- masked-language-model
- korean
- pretrained
metrics:
- perplexity
pipeline_tag: fill-mask
model-index:
- name: bert-ko-pretrained
results:
- task:
type: fill-mask
name: Masked Language Modeling
metrics:
- name: Eval Loss
type: loss
value: 3.6679
- name: Eval Perplexity
type: perplexity
value: 39.17
---
# bert-ko-pretrained
ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ๋กœ ์‚ฌ์ „ํ•™์Šต๋œ BERT (Masked Language Model) ์ž…๋‹ˆ๋‹ค.
## ๋ชจ๋ธ ์ •๋ณด
| ํ•ญ๋ชฉ | ๊ฐ’ |
|------|-----|
| Architecture | BertForMaskedLM |
| Hidden Size | 256 |
| Layers | 4 |
| Attention Heads | 4 |
| Intermediate Size | 1024 |
| Vocab Size | 32,000 |
| Max Length | 256 tokens |
| Parameters | 11,515,904 |
| Total Steps | 50,000 |
## ์‚ฌ์ „ํ•™์Šต ์„ฑ๋Šฅ (MLM)
| Split | Loss | Perplexity |
|-------|-----:|-----------:|
| Eval | 3.6679 | 39.17 |
## ํ•™์Šต ์ฝ”ํผ์Šค
| ์ฝ”ํผ์Šค | ํฌ๊ธฐ | ์„ค๋ช… |
|--------|------|------|
| injection_corpus.txt | 65MB | ํ”„๋กฌํ”„ํŠธ ์ธ์ ์…˜ ๋ฐ์ดํ„ฐ |
| external_all.txt | 9.6MB | KoSBi v2 + K-MHaS + BEEP\! |
| all_combined.txt | 15MB | ์ „์ฒด ํ†ตํ•ฉ ์ฝ”ํผ์Šค |
**์ด ~90MB** ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ
## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### Fill-Mask
### ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋ฐฑ๋ณธ์œผ๋กœ ์‚ฌ์šฉ
## ํ•™์Šต ์„ค์ •
- **Tokenizer**: WordPiece (vocab_size=32,000)
- **Optimizer**: AdamW
- **Scheduler**: Cosine with warmup
- **MLM Probability**: 15%
## ๋ผ์ด์„ ์Šค
GPL-3.0 License