docs: update baseline_v8 metrics

5e1a371 verified 25 days ago

872 Bytes

language: ko
license: apache-2.0
base_model: unsloth/Llama-3.2-1B-unsloth-bnb-4bit
tags:
  - continued-pretraining
  - wikipedia
  - korean

backtesting-v8

Evaluation Results

Strategy	Description
Data Filtering	한글 비율 60%, 반복줄 탐지, 특수문자 비율 제한
Text Cleaning	위키 마크업/URL/HTML 정제
Tokenizer	ByteLevel BPE (6k), 한글 토큰만 추출, PEFT 안전 resize
Curriculum Learning	짧은->긴 문서 순 학습으로 초반 안정성 향상
Prompt Format	BOS 명시, 자연스러운 한국어 위키 포맷