zeronamoni's picture
docs: update baseline_v8 metrics
5e1a371 verified
metadata
language: ko
license: apache-2.0
base_model: unsloth/Llama-3.2-1B-unsloth-bnb-4bit
tags:
  - continued-pretraining
  - wikipedia
  - korean

backtesting-v8

Evaluation Results

Metric baseline_v8
Evaluation Loss 2.3498
Perplexity (PPL) 10.48
BPC 3.3900

v8 Improvements over v7

Strategy Description
Data Filtering ํ•œ๊ธ€ ๋น„์œจ 60%, ๋ฐ˜๋ณต์ค„ ํƒ์ง€, ํŠน์ˆ˜๋ฌธ์ž ๋น„์œจ ์ œํ•œ
Text Cleaning ์œ„ํ‚ค ๋งˆํฌ์—…/URL/HTML ์ •์ œ
Tokenizer ByteLevel BPE (6k), ํ•œ๊ธ€ ํ† ํฐ๋งŒ ์ถ”์ถœ, PEFT ์•ˆ์ „ resize
Curriculum Learning ์งง์€->๊ธด ๋ฌธ์„œ ์ˆœ ํ•™์Šต์œผ๋กœ ์ดˆ๋ฐ˜ ์•ˆ์ •์„ฑ ํ–ฅ์ƒ
Prompt Format BOS ๋ช…์‹œ, ์ž์—ฐ์Šค๋Ÿฌ์šด ํ•œ๊ตญ์–ด ์œ„ํ‚ค ํฌ๋งท

Experiment Details

  • Version: baseline_v8
  • Base Model: Llama-3.2-1B (Unsloth 4-bit)
  • Korean BPE Tokens Added: 0