| language: ko | |
| license: apache-2.0 | |
| base_model: unsloth/Llama-3.2-1B-unsloth-bnb-4bit | |
| tags: | |
| - continued-pretraining | |
| - wikipedia | |
| - korean | |
| # backtesting-v8 | |
| ## Evaluation Results | |
| | Metric | baseline_v8 | | |
| | --- | --- | | |
| | Evaluation Loss | 2.3498 | | |
| | Perplexity (PPL) | 10.48 | | |
| | BPC | 3.3900 | | |
| ## v8 Improvements over v7 | |
| | Strategy | Description | | |
| | --- | --- | | |
| | Data Filtering | ํ๊ธ ๋น์จ 60%, ๋ฐ๋ณต์ค ํ์ง, ํน์๋ฌธ์ ๋น์จ ์ ํ | | |
| | Text Cleaning | ์ํค ๋งํฌ์ /URL/HTML ์ ์ | | |
| | Tokenizer | ByteLevel BPE (6k), ํ๊ธ ํ ํฐ๋ง ์ถ์ถ, PEFT ์์ resize | | |
| | Curriculum Learning | ์งง์->๊ธด ๋ฌธ์ ์ ํ์ต์ผ๋ก ์ด๋ฐ ์์ ์ฑ ํฅ์ | | |
| | Prompt Format | BOS ๋ช ์, ์์ฐ์ค๋ฌ์ด ํ๊ตญ์ด ์ํค ํฌ๋งท | | |
| ## Experiment Details | |
| * **Version**: baseline_v8 | |
| * **Base Model**: Llama-3.2-1B (Unsloth 4-bit) | |
| * **Korean BPE Tokens Added**: 0 | |