---
language: ko
license: apache-2.0
base_model: unsloth/Llama-3.2-1B-unsloth-bnb-4bit
tags:
- continued-pretraining
- wikipedia
- korean
---

# backtesting-v8

## Evaluation Results
| Metric | baseline_v8 |
| --- | --- |
| Evaluation Loss | 2.3498 |
| Perplexity (PPL) | 10.48 |
| BPC | 3.3900 |

## v8 Improvements over v7
| Strategy | Description |
| --- | --- |
| Data Filtering | 한글 비율 60%, 반복줄 탐지, 특수문자 비율 제한 |
| Text Cleaning | 위키 마크업/URL/HTML 정제 |
| Tokenizer | ByteLevel BPE (6k), 한글 토큰만 추출, PEFT 안전 resize |
| Curriculum Learning | 짧은->긴 문서 순 학습으로 초반 안정성 향상 |
| Prompt Format | BOS 명시, 자연스러운 한국어 위키 포맷 |

## Experiment Details
* **Version**: baseline_v8
* **Base Model**: Llama-3.2-1B (Unsloth 4-bit)
* **Korean BPE Tokens Added**: 0