| "description": "--- language: - ko tags: - pytorch - causal-lm license: apache-2.0 --- # Polyglot-Ko-1.3B ## Model Description Polyglot-Ko is a series of large-scale Korean autoregressive language models made by the EleutherAI polyglot team. | Hyperparameter | Value | |----------------------|----------------------------------------------------------------------------------------------------------------------------------------| | \\\\(n_{parameters}\\\\) | 1,331,810,304 | | \\\\(n_{layers}\\\\) | 24 | | \\\\(d_{model}\\\\) | 2,048 | | \\\\(d_{ff}\\\\) | 8,192 | | \\\\(n_{heads}\\\\) | 16 | | \\\\(d_{head}\\\\) | 128 | | \\\\(n_{ctx}\\\\) | 2,048 | | \\\\(n_{vocab}\\\\) | 30,003 / 30,080 | | Positional Encoding | Rotary Position Embedding (RoPE) | | RoPE Dimensions | 64 | The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 30003. ## Training data Polyglot-Ko-1.3B was trained on 863 GB of Korean language data (1.2TB before processing), a large-scale dataset curated by TUNiB. The data collection process has abided by South Korean laws. This dataset was collected for the purpose of training Polyglot-Ko models, so it will not be released for public use. | Source |Size (GB) | Link | |-------------------------------------|---------|------------------------------------------| | Korean blog posts | 682.3 | - | | Korean news dataset | 87.0 | - | | Modu corpus | 26.4 |corpus.korean.go.kr | | Korean patent dataset | 19.0 | - | | Korean Q & A dataset | 18.1 | - | | KcBert dataset | 12.7 | github.com/Beomi/KcBERT | | Korean fiction dataset | 6.1 | - | | Korean online comments | 4.2 | - | | Korean wikipedia | 1.4 | ko.wikipedia.org | | Clova call | < 1.0 | github.com/clovaai/ClovaCall | | Naver sentiment movie corpus | < 1.0 | github.com/e9t/nsmc | | Korean hate speech dataset | < 1.0 | - | | Open subtitles | < 1.0 | opus.nlpl.eu/OpenSubtitles.php | | AIHub various tasks datasets | < 1.0 |aihub.or.kr | | Standard Korean language dictionary | < 1.0 | stdict.korean.go.kr/main/main.do | Furthermore, in order to avoid the model memorizing and generating personally identifiable information (PII) in the training data, we masked out the following sensitive information in the pre-processing stage: * : bank account number * : resident registration number * : phone number ## Training procedure Polyglot-Ko-1.3B was trained on 213 billion tokens over 102,000 steps on 256 A100 GPUs with the GPT-NeoX framework. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token. ## How to use This model can be easily loaded using the class: ## Evaluation results We evaluate Polyglot-Ko-1.3B on KOBEST dataset, a benchmark with 5 downstream tasks, against comparable models such as skt/ko-gpt-trinity-1.2B-v0.5, kakaobrain/kogpt and facebook/xglm-7.5B, using the prompts provided in the paper. The following tables show the results when the number of few-shot examples differ. You can reproduce these results using the polyglot branch of lm-evaluation-harness and the following scripts. For a fair comparison, all models were run under the same conditions and using the same prompts. In the tables, refers to the number of few-shot examples. In case of WiC dataset, all models show random performance. ### COPA (F1) | Model | params | 0-shot | 5-shot | 10-shot | 50-shot | |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------| | skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.6696 | 0.6477 | 0.6419 | 0.6514 | | kakaobrain/kogpt | 6.0B | 0.7345 | 0.7287 | 0.7277 | 0.7479 | | facebook/xglm-7.5B | 7.5B | 0.6723 | 0.6731 | 0.6769 | 0.7119 | | **EleutherAI/polyglot-ko-1.3b (this)** | **1.3B** | **0.7196** | **0.7193** | **0.7204** | **0.7206** | | EleutherAI/polyglot-ko-3.8b | 3.8B | 0.7595 | 0.7608 | 0.7638 | 0.7788 | | EleutherAI/polyglot-ko-5.8b | 5.8B | 0.7745 | 0.7676 | 0.7775 | 0.7887 | | EleutherAI/polyglot-ko-12.8b | 12.8B | 0.7937 | 0.8108 | 0.8037 | 0.8369 | <img src=\" width=\"800px\"> ### HellaSwag (F1) | Model | params | 0-shot | 5-shot | 10-shot | 50-shot | |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------| | skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.5243 | 0.5272 | 0.5166 | 0.5352 | | kakaobrain/kogpt | 6.0B | 0.5590 | 0.5833 | 0.5828 | 0.5907 | | facebook/xglm-7.5B | 7.5B | 0.5665 | 0.5689 | 0.5565 | 0.5622 | | **EleutherAI/polyglot-ko-1.3b (this)** | **1.3B** | **0.5247** | **0.5260** | **0.5278** | **0.5427** | | EleutherAI/polyglot-ko-3.8b | 3.8B | 0.5707 | 0.5830 | 0.5670 | 0.5787 | | EleutherAI/polyglot-ko-5.8b | 5.8B | 0.5976 | 0.5998 | 0.5979 | 0.6208 | | EleutherAI/polyglot-ko-12.8b | 12.8B | 0.5954 | 0.6306 | 0.6098 | 0.6118 | <img src=\" width=\"800px\"> ### BoolQ (F1) | Model | params | 0-shot | 5-shot | 10-shot | 50-shot | |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------| | skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.3356 | 0.4014 | 0.3640 | 0.3560 | | kakaobrain/kogpt | 6.0B | 0.4514 | 0.5981 | 0.5499 | 0.5202 | | facebook/xglm-7.5B | 7.5B | 0.4464 | 0.3324 | 0.3324 | 0.3324 | | **EleutherAI/polyglot-ko-1.3b (this)** | **1.3B** | **0.3552** | **0.4751** | **0.4109** | **0.4038** | | EleutherAI/polyglot-ko-3.8b | 3.8B | 0.4320 | 0.5263 | 0.4930 | 0.4038 | | EleutherAI/polyglot-ko-5.8b | 5.8B | 0.4356 | 0.5698 | 0.5187 | 0.5236 | | EleutherAI/polyglot-ko-12.8b | 12.8B | 0.4818 | 0.6041 | 0.6289 | 0.6448 | <img src=\" width=\"800px\"> ### SentiNeg (F1) | Model | params | 0-shot | 5-shot | 10-shot | 50-shot | |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------| | skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.6065 | 0.6878 | 0.7280 | 0.8413 | | kakaobrain/kogpt | 6.0B | 0.3747 | 0.8942 | 0.9294 | 0.9698 | | facebook/xglm-7.5B | 7.5B | 0.3578 | 0.4471 | 0.3964 | 0.5271 | | **EleutherAI/polyglot-ko-1.3b (this)** | **1.3B** | **0.6790** | **0.6257** | **0.5514** | **0.7851** | | EleutherAI/polyglot-ko-3.8b | 3.8B | 0.4858 | 0.7950 | 0.7320 | 0.7851 | | EleutherAI/polyglot-ko-5.8b | 5.8B | 0.3394 | 0.8841 | 0.8808 | 0.9521 | | EleutherAI/polyglot-ko-12.8b | 12.8B | 0.9117 | 0.9015 | 0.9345 | 0.9723 | <img src=\" width=\"800px\"> ### WiC (F1) | Model | params | 0-shot | 5-shot | 10-shot | 50-shot | |----------------------------------------------------------------------------------------------|--------|--------|--------|---------|---------| | skt/ko-gpt-trinity-1.2B-v0.5 | 1.2B | 0.3290 | 0.4313 | 0.4001 | 0.3621 | | kakaobrain/kogpt | 6.0B | 0.3526 | 0.4775 | 0.4358 | 0.4061 | | facebook/xglm-7.5B | 7.5B | 0.3280 | 0.4903 | 0.4945 | 0.3656 | | **EleutherAI/polyglot-ko-1.3b (this)** | **1.3B** | **0.3297** | **0.4850** | **0.465** | **0.3290** | | EleutherAI/polyglot-ko-3.8b | 3.8B | 0.3390 | 0.4944 | 0.4203 | 0.3835 | | EleutherAI/polyglot-ko-5.8b | 5.8B | 0.3913 | 0.4688 | 0.4189 | 0.3910 | | EleutherAI/polyglot-ko-12.8b | 12.8B | 0.3985 | 0.3683 | 0.3307 | 0.3273 | <img src=\" width=\"800px\"> ## Limitations and Biases Polyglot-Ko has been trained to optimize next token prediction. Language models such as this are often used for a wide variety of tasks and it is important to be aware of possible unexpected outcomes. For instance, Polyglot-Ko will not always return the most factual or accurate response but the most statistically likely one. In addition, Polyglot may produce socially unacceptable or offensive content. We recommend having a human curator or other filtering mechanism to censor sensitive content. ## Citation and Related Information ### BibTeX entry If you find our work useful, please consider citing: ### Licensing All our models are licensed under the terms of the Apache License 2.0. ### Acknowledgement This project was made possible thanks to the computing resources from Stability.ai, and thanks to TUNiB for providing a large-scale Korean dataset for this work.", |