zeronamoni's picture
docs: update baseline_v8 metrics
5e1a371 verified
---
language: ko
license: apache-2.0
base_model: unsloth/Llama-3.2-1B-unsloth-bnb-4bit
tags:
- continued-pretraining
- wikipedia
- korean
---
# backtesting-v8
## Evaluation Results
| Metric | baseline_v8 |
| --- | --- |
| Evaluation Loss | 2.3498 |
| Perplexity (PPL) | 10.48 |
| BPC | 3.3900 |
## v8 Improvements over v7
| Strategy | Description |
| --- | --- |
| Data Filtering | ํ•œ๊ธ€ ๋น„์œจ 60%, ๋ฐ˜๋ณต์ค„ ํƒ์ง€, ํŠน์ˆ˜๋ฌธ์ž ๋น„์œจ ์ œํ•œ |
| Text Cleaning | ์œ„ํ‚ค ๋งˆํฌ์—…/URL/HTML ์ •์ œ |
| Tokenizer | ByteLevel BPE (6k), ํ•œ๊ธ€ ํ† ํฐ๋งŒ ์ถ”์ถœ, PEFT ์•ˆ์ „ resize |
| Curriculum Learning | ์งง์€->๊ธด ๋ฌธ์„œ ์ˆœ ํ•™์Šต์œผ๋กœ ์ดˆ๋ฐ˜ ์•ˆ์ •์„ฑ ํ–ฅ์ƒ |
| Prompt Format | BOS ๋ช…์‹œ, ์ž์—ฐ์Šค๋Ÿฌ์šด ํ•œ๊ตญ์–ด ์œ„ํ‚ค ํฌ๋งท |
## Experiment Details
* **Version**: baseline_v8
* **Base Model**: Llama-3.2-1B (Unsloth 4-bit)
* **Korean BPE Tokens Added**: 0