frankenstallm / source /eval /plan /MASTER_PLAN.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d

๐Ÿ—บ๏ธ MASTER PLAN: ํ•œ๊ตญ์–ด LLM 1B ์žฌํ•™์Šต โ†’ 3B โ†’ ๋ฐฐํฌ

์ž‘์„ฑ์ผ: 2026-02-27
ํ”„๋กœ์ ํŠธ: /PROJECT/0325120031_A/ghong/taketimes/llm-bang/
๊ฒฐ์ •: Restart (base checkpoint์—์„œ ํด๋ฆฐ ์žฌํ•™์Šต)
์ด ์˜ˆ์ƒ ๊ธฐ๊ฐ„: ~35์‹œ๊ฐ„ (1B: 3์‹œ๊ฐ„ โ†’ 3B pretrain: 26์‹œ๊ฐ„ โ†’ 3B SFT+ํ‰๊ฐ€: 6์‹œ๊ฐ„)


๐Ÿ“Š ์ „์ฒด ํƒ€์ž„๋ผ์ธ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

Phase 0  โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  30๋ถ„    ๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์ค€๋น„
Phase 1  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  40๋ถ„    1B SFT ์žฌํ•™์Šต
Phase 2  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  2์‹œ๊ฐ„   1B ํ‰๊ฐ€
         โ”€โ”€โ”€โ”€โ”€โ”€ ์—ฌ๊ธฐ์„œ ํŒ๋‹จ โ”€โ”€โ”€โ”€โ”€โ”€
Phase 3A โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  3-5์‹œ๊ฐ„  (์กฐ๊ฑด๋ถ€) 1B ์ถ”๊ฐ€ ๊ฐœ์„ 
Phase 3B โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  26์‹œ๊ฐ„   3B ์‚ฌ์ „ํ•™์Šต
Phase 4  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  2์‹œ๊ฐ„   3B SFT
Phase 5  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  4์‹œ๊ฐ„   ํ‰๊ฐ€ & ๋ฐฐํฌ

Phase 0: ์žฌํ•™์Šต ์ง์ „ ์ค€๋น„ (์˜ค๋Š˜, ~30๋ถ„)

์ฒดํฌ๋ฆฌ์ŠคํŠธ

โ˜ 0-1. ๋ฐ์ดํ„ฐ ์žฌ์ƒ์„ฑ (~20๋ถ„)

cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang

# prepare_sft_data.py ์žฌ์‹คํ–‰ (๊ฐ•ํ™” ํ•„ํ„ฐ + ์ˆ˜์ •๋œ ๊ฐ€์ค‘์น˜)
python data/prepare_sft_data.py \
    --output_dir data/sft_v2/ \
    --val_split 0.1

ํ™•์ธ ์‚ฌํ•ญ:

  • ํ•„ํ„ฐ๋ง ํ›„ 120K-135K ์ƒ˜ํ”Œ ๋‚จ์•„์•ผ ํ•จ (๊ธฐ์กด 159K์—์„œ ์ €ํ’ˆ์งˆ ์ œ๊ฑฐ)
  • </s> ๋ฆฌํ„ฐ๋Ÿด 113๊ฑด, Q/A ๋งˆ์ปค ~550๊ฑด, ์ž์ฒด๋ฐ˜๋ณต 57๊ฑด ์ œ๊ฑฐ ํ™•์ธ
  • OpenOrca ๊ฐ€์ค‘์น˜: 5.0 โ†’ 2.0์œผ๋กœ ๊ฐ์†Œ ํ™•์ธ
  • Val split: ~12-13K ์ƒ˜ํ”Œ (10%)
  • ์งง์€ output (<80์ž) ์ œ๊ฑฐ ํ™•์ธ
# ๊ฒฐ๊ณผ ํ™•์ธ
wc -l data/sft_v2/train.jsonl data/sft_v2/val.jsonl
# ์˜ˆ์ƒ: train ~108K-120K, val ~12K-13K

์™„๋ฃŒ ๊ธฐ์ค€: train 100K+ ์ƒ˜ํ”Œ, val 10K+ ์ƒ˜ํ”Œ. ์ œ๊ฑฐ๋œ ์ƒ˜ํ”Œ spot check ์‹œ ์‹ค์ œ ์ €ํ’ˆ์งˆ.

โ˜ 0-2. sft_dataset.py ์ˆ˜์ • ํ™•์ธ (~5๋ถ„)

์ด๋ฏธ ์ˆ˜์ •๋œ ํ•ญ๋ชฉ ํ™•์ธ:

์ˆ˜์ • ์‚ฌํ•ญ ํŒŒ์ผ ํ™•์ธ
Dynamic padding ์‹ค์ œ ์ž‘๋™ data/sft_dataset.py __getitem__ โ˜ ํŒจ๋”ฉ ์—†์ด ์‹ค์ œ ๊ธธ์ด ํ…์„œ ๋ฐ˜ํ™˜
EOS ๋ณด์กด data/sft_dataset.py L130-134 โ˜ response_ids[:allowed-1] + [eos_id]
Collate fn data/sft_dataset.py dynamic_collate_fn โ˜ ๋ฐฐ์น˜๋ณ„ ๊ฐ€๋ณ€ ํŒจ๋”ฉ
# ํ•ต์‹ฌ ์ฝ”๋“œ ํ™•์ธ
grep -n "allowed_response" data/sft_dataset.py
grep -n "eos_token_id" data/sft_dataset.py
grep -n "torch.full" data/sft_dataset.py  # 4096 ๊ณ ์ • ํŒจ๋”ฉ ์—†์–ด์•ผ ํ•จ

โ˜ 0-3. launch_sft.sh ์ˆ˜์ • (~5๋ถ„)

# ๋ณ€๊ฒฝํ•  ๊ฐ’๋“ค:
# RUN_NAME=korean_1b_sft_v2
# SFT_DATA=data/sft_v2/train.jsonl
# VAL_DATA=data/sft_v2/val.jsonl
# MAX_STEPS=10000  (3-4 epoch, ๊ธฐ์กด 5000์—์„œ ์ฆ๊ฐ€)
# WARMUP_STEPS=300  (3%)

cp scripts/launch_sft.sh scripts/launch_sft_v2.sh
# ํŽธ์ง‘ ํ›„ diff ํ™•์ธ

โ˜ 0-4. Sanity Check (~5๋ถ„)

# 100 steps๋งŒ ๋น ๋ฅด๊ฒŒ ๋Œ๋ ค์„œ ํŒŒ์ดํ”„๋ผ์ธ ์ •์ƒ ํ™•์ธ
bash scripts/launch_sft_v2.sh --max_steps 100

# ํ™•์ธ:
# - Loss๊ฐ€ 2.0-2.5 ๋ฒ”์œ„์—์„œ ์‹œ์ž‘ํ•˜๋Š”๊ฐ€? โœ…
# - ๋ฐฐ์น˜ ๋‚ด ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๊ฐ€๋ณ€์ ์ธ๊ฐ€? (๋กœ๊ทธ์—์„œ ํ™•์ธ) โœ…
# - Val loss๊ฐ€ ์ถœ๋ ฅ๋˜๋Š”๊ฐ€? โœ…
# - OOM ์—†๋Š”๊ฐ€? โœ…

์™„๋ฃŒ ๊ธฐ์ค€: 100 steps ์—๋Ÿฌ ์—†์ด ์™„๋ฃŒ, loss ํ•ฉ๋ฆฌ์  ๋ฒ”์œ„, val loss ์ถœ๋ ฅ ํ™•์ธ.


Phase 1: 1B SFT ์žฌํ•™์Šต (์˜ค๋Š˜, ~40๋ถ„)

์‹คํ–‰ ๋ช…๋ น์–ด

cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang

RUN_NAME=korean_1b_sft_v2 \
BASE_CHECKPOINT=checkpoints/korean_1b_fp8_run1/checkpoint-0034000 \
SFT_DATA=data/sft_v2/train.jsonl \
VAL_DATA=data/sft_v2/val.jsonl \
MAX_STEPS=10000 \
WARMUP_STEPS=300 \
LR=2.0e-5 \
bash scripts/launch_sft.sh

๋ชจ๋‹ˆํ„ฐ๋ง

์‹ค์‹œ๊ฐ„ ๋กœ๊ทธ:

tail -f checkpoints/korean_1b_sft_v2/train.log

TensorBoard:

tensorboard --logdir checkpoints/korean_1b_sft_v2/tensorboard --port 6007

ํ•ต์‹ฌ ์ˆ˜์น˜:

์ˆ˜์น˜ ์ •์ƒ ๋ฒ”์œ„ ๊ฒฝ๊ณ  ์ฆ‰์‹œ ์ค‘๋‹จ
Train Loss ์‹œ์ž‘ 2.0-2.5, ์ตœ์ข… <1.90 >2.5 at step 500+ >3.0 (๋ฐœ์‚ฐ)
Val Loss Train์˜ 1.0-1.1๋ฐฐ Train์˜ 1.2๋ฐฐ Train ๋Œ€๋น„ ๊ณ„์† ์ƒ์Šน (๊ณผ์ ํ•ฉ)
GNorm 0.8-1.5 >2.0 >5.0 (gradient ํญ๋ฐœ)
ํ•™์Šต ์†๋„ ๊ธฐ์กด ๋Œ€๋น„ 2x+ (dynamic padding ํšจ๊ณผ) ๊ธฐ์กด๊ณผ ๋น„์Šท ๊ธฐ์กด๋ณด๋‹ค ๋А๋ฆผ

์ฒดํฌํฌ์ธํŠธ ๊ด€์ฐฐ:

  • Step 500: ํŒŒ์ดํ”„๋ผ์ธ ์•ˆ์ •์„ฑ ํ™•์ธ
  • Step 2500: ์ค‘๊ฐ„ ์ง€์ , loss ์ถ”์„ธ ํ™•์ธ
  • Step 5000: ๊ธฐ์กด ํ•™์Šต๊ณผ ๋น„๊ต (loss < 1.97์ด์–ด์•ผ ํ•จ)
  • Step 7500: ์ˆ˜๋ ด ์—ฌ๋ถ€ ํ™•์ธ
  • Step 10000: ์ตœ์ข…

์„ฑ๊ณต ๊ธฐ์ค€

์ง€ํ‘œ ๋ชฉํ‘œ ์‹คํŒจ ๊ธฐ์ค€
Final Train Loss < 1.90 > 2.00
Final Val Loss < 2.00 Train ๋Œ€๋น„ 1.2๋ฐฐ ์ดˆ๊ณผ
Val Loss ์ถ”์„ธ ํ•˜๊ฐ• or ์•ˆ์ • 3์—ฐ์† ์ƒ์Šน (๊ณผ์ ํ•ฉ)
ํ•™์Šต ์‹œ๊ฐ„ ~40-60๋ถ„ >2์‹œ๊ฐ„ (dynamic padding ๋ฏธ์ž‘๋™)

์‹คํŒจ ์‹œ ๋Œ€์‘

์ƒํ™ฉ ์›์ธ ์ถ”์ • ๋Œ€์‘
Loss ๋ฐœ์‚ฐ (>3.0) LR ๊ณผ๋‹ค or ๋ฐ์ดํ„ฐ ๋ฒ„๊ทธ LR=1e-5๋กœ ์žฌ์‹œ๋„
OOM ๋ฐฐ์น˜ ํฌ๊ธฐ ๊ณผ๋‹ค BATCH_SIZE=2๋กœ ๊ฐ์†Œ
Loss ์ •์ฒด (step 2000+ ๋ณ€ํ™” ์—†์Œ) LR ๋ถ€์กฑ or ๋ฐ์ดํ„ฐ ๋ฌธ์ œ ๋ฐ์ดํ„ฐ ์ ๊ฒ€, LR=3e-5 ์‹œ๋„
Val Loss ๋ฐœ์‚ฐ (๊ณผ์ ํ•ฉ) Epoch ๊ณผ๋‹ค Early stop at best val checkpoint
ํ•™์Šต ์†๋„ ๊ธฐ์กด๊ณผ ๊ฐ™์Œ Dynamic padding ๋ฏธ์ž‘๋™ sft_dataset.py ์žฌ์ ๊ฒ€

Phase 2: 1B SFT ํ‰๊ฐ€ (~2์‹œ๊ฐ„)

ํ‰๊ฐ€ ์ˆœ์„œ

2-1. ๋ฐ˜๋ณต๋ฅ  ์ธก์ • (30๋ถ„)

# ์˜ฌ๋ฐ”๋ฅธ ํฌ๋งท(<|user|>/<|assistant|>)์œผ๋กœ ์ƒ์„ฑ ํ…Œ์ŠคํŠธ
python eval/test_generation_params.py \
    --checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000

# ๋‹ค์–‘ํ•œ rep_penalty ํ…Œ์ŠคํŠธ
# rep_penalty=1.0 (์—†์Œ): ๋ชฉํ‘œ <10%
# rep_penalty=1.1:        ๋ชฉํ‘œ <3%
# rep_penalty=1.2:        ๋ชฉํ‘œ <1%

2-2. ์ƒ์„ฑ ํ’ˆ์งˆ ์ฃผ๊ด€ ํ‰๊ฐ€ (30๋ถ„)

python eval/generate.py \
    --checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000 \
    --prompts_file eval/test_prompts.txt \
    --temperature 0.8 --top_p 0.9

์ฒดํฌ: ํ•œ๊ตญ์–ด ์ž์—ฐ์Šค๋Ÿฌ์›€, instruction following, EOS ์ •์ƒ ์ข…๋ฃŒ

2-3. ๊ณต์‹ ๋ฒค์น˜๋งˆํฌ (1์‹œ๊ฐ„)

# ko_ifeval
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \
    --tasks ko_ifeval \
    --device cuda:0 \
    --output_path eval/results/sft_v2_ko_ifeval.json

# ko_winogrande (์„ ํƒ)
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \
    --tasks ko_winogrande \
    --device cuda:0 \
    --output_path eval/results/sft_v2_ko_winogrande.json

ํŒ๋‹จ ๊ธฐ์ค€ & ๋ถ„๊ธฐ

                    [Phase 2 ํ‰๊ฐ€ ๊ฒฐ๊ณผ]
                          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚                     โ”‚                     โ”‚
  โœ… PASS              โš ๏ธ PARTIAL            โŒ FAIL
 ๋ฐ˜๋ณต๋ฅ <5%            ๋ฐ˜๋ณต๋ฅ  5-15%          ๋ฐ˜๋ณต๋ฅ >15%
 ko_ifeval>25%       ko_ifeval 15-25%      ko_ifeval<15%
    โ”‚                     โ”‚                     โ”‚
    โ–ผ                     โ–ผ                     โ–ผ
 Phase 3B             Phase 3A              ์›์ธ ๋ถ„์„
 (3B ์ „ํ™˜)          (์ถ”๊ฐ€ ๊ฐœ์„ )           (๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์žฌ๊ฒ€ํ† )

์ƒ์„ธ ๊ธฐ์ค€:

์ง€ํ‘œ โœ… Pass โš ๏ธ ์ถ”๊ฐ€ ์กฐ์ • โŒ ์žฌํ•™์Šต
๋ฐ˜๋ณต๋ฅ  (rep_penalty ์—†์ด) <10% 10-20% >20%
๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) <5% 5-15% >15%
ko_ifeval >25% 15-25% <15%
EOS ์ •์ƒ ์ข…๋ฃŒ์œจ >85% 60-85% <60%

Phase 3A: 1B ์ถ”๊ฐ€ ๊ฐœ์„  (์กฐ๊ฑด๋ถ€, ~3-5์‹œ๊ฐ„)

Phase 2 ๊ฒฐ๊ณผ๊ฐ€ โš ๏ธ PARTIAL์ผ ๋•Œ๋งŒ ์ง„์ž…

์˜ต์…˜ A: ORPO ํ•™์Šต (~3์‹œ๊ฐ„)

Preference Data ์ค€๋น„ (1์‹œ๊ฐ„)

# ํ•œ๊ตญ์–ด preference ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ
python -c "
from datasets import load_dataset
# ์˜ต์…˜ 1: ko_Ultrafeedback (60K, ์ผ๋ฐ˜ ๋„๋ฉ”์ธ)
ds = load_dataset('maywell/ko_Ultrafeedback')
# ์˜ต์…˜ 2: ์ž์ฒด ์ƒ์„ฑ (ํ˜„์žฌ ๋ชจ๋ธ๋กœ rejected ์ƒ์„ฑ)
"

์ž์ฒด ์ƒ์„ฑ ๋ฐฉ๋ฒ•:

  1. ํ˜„์žฌ SFT ๋ชจ๋ธ๋กœ ๋™์ผ ํ”„๋กฌํ”„ํŠธ์— ์—ฌ๋Ÿฌ ๋ฒˆ ์ƒ์„ฑ
  2. ๋ฐ˜๋ณต/์ €ํ’ˆ์งˆ ์ถœ๋ ฅ โ†’ rejected
  3. ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ์˜ ์ •๋‹ต โ†’ chosen
  4. ~10K-20K ์Œ ์ƒ์„ฑ

ORPO ํ•™์Šต (1.5์‹œ๊ฐ„)

from trl import ORPOConfig, ORPOTrainer

config = ORPOConfig(
    learning_rate=5e-7,
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    beta=0.1,  # ORPO coefficient
)
trainer = ORPOTrainer(model, config, train_dataset=preference_data)
trainer.train()

ํ‰๊ฐ€ (30๋ถ„)

  • ๋ฐ˜๋ณต๋ฅ  ์žฌ์ธก์ •: ๋ชฉํ‘œ <5% (rep_penalty=1.1)
  • ko_ifeval ์žฌ์ธก์ •: ๋ชฉํ‘œ >20%

์˜ต์…˜ B: ์ถ”๊ฐ€ SFT (๋ฐ์ดํ„ฐ ๋ณด๊ฐ•, ~5์‹œ๊ฐ„)

์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (2์‹œ๊ฐ„)

from datasets import load_dataset

# ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€
datasets = {
    "hPark/orca-ko": 200_000,          # ๊ณ ํ’ˆ์งˆ ํ•ฉ์„ฑ
    "nayohan/llama3-instruct-ko-dataset": 58_000,  # Llama3 ํ•œ๊ตญ์–ด
    "FreedomIntelligence/evol-instruct-korean": 70_000,  # GPT-4 ์ƒ์„ฑ
}
# ๊ธฐ์กด 120K + ์ถ”๊ฐ€ ~300K โ†’ ํ•„ํ„ฐ ํ›„ ~350K

์žฌํ•™์Šต (2์‹œ๊ฐ„)

# ์ฆ๊ฐ€๋œ ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šต
RUN_NAME=korean_1b_sft_v3 \
SFT_DATA=data/sft_v3/train.jsonl \
MAX_STEPS=15000 \
bash scripts/launch_sft.sh

Phase 3A ์„ฑ๊ณต ๊ธฐ์ค€

์ง€ํ‘œ ๋ชฉํ‘œ
๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) <5%
ko_ifeval >20%

์‹คํŒจ ์‹œ: 1B ํ•œ๊ณ„ ์ธ์ •, Phase 3B (3B ์ „ํ™˜)๋กœ ๋ฐ”๋กœ ์ด๋™.


Phase 3B: 3B ์‚ฌ์ „ํ•™์Šต (Phase 2 ํ†ต๊ณผ ํ›„, ~26์‹œ๊ฐ„)

3B ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜

ํŒŒ๋ผ๋ฏธํ„ฐ 1B (ํ˜„์žฌ) 3B (๋ชฉํ‘œ) ๋น„๊ณ 
d_model 2048 2560 ~1.25x
n_layers 24 32 ~1.33x
n_heads 16 32 2x
n_kv_heads (GQA) 4 8 2x
d_ffn 5472 6912 ~1.26x
vocab_size 64000 64000 ๋™์ผ
max_seq_len 4096 4096 ๋™์ผ
์ด ํŒŒ๋ผ๋ฏธํ„ฐ 1.19B ~3.0B ~2.5x

์„ค์ • ํŒŒ์ผ ์ž‘์„ฑ

# configs/korean_3b_fp8.yaml ์ž‘์„ฑ
cat > configs/korean_3b_fp8.yaml << 'EOF'
model:
  d_model: 2560
  n_layers: 32
  n_heads: 32
  n_kv_heads: 8
  d_ffn: 6912
  vocab_size: 64000
  max_seq_len: 4096
  rope_theta: 500000

training:
  lr: 3.0e-4
  min_lr: 3.0e-5
  warmup_steps: 2000
  max_steps: 100000
  batch_size: 4
  grad_accum: 4
  weight_decay: 0.1
  use_fp8: true

data:
  sources:
    - cc100_ko
    - culturax_ko
    - existing_pretrain
EOF

์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ

์†Œ์Šค ํ† ํฐ ์ˆ˜ ์ƒํƒœ
CulturaX ko 24.8B โœ… ๋ณด์œ 
cc100 ko (์žฌ์ˆ˜์ง‘) ~65-100B โš ๏ธ ์žฌ์ˆ˜์ง‘ ํ•„์š” (๋…ธ์ด์ฆˆ ํ•„ํ„ฐ๋ง)
๊ธฐ์กด pretrain ๋ฐ์ดํ„ฐ ~8.9B โœ… ๋ณด์œ 
์ถ”๊ฐ€ ์ˆ˜์ง‘ (๋‚˜๋ฌด์œ„ํ‚ค, ๋‰ด์Šค ๋“ฑ) ~20-50B ์„ ํƒ์ 
ํ•ฉ๊ณ„ ~120-180B Chinchilla 60B ์ตœ์†Œ ์ถฉ์กฑ

๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ช…๋ น์–ด:

# cc100 ์žฌ์ˆ˜์ง‘ + ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง
python scripts/download_cc100_ko.py --quality_filter --dedup
# MinHash dedup + perplexity filter
python scripts/quality_filter.py --input data/pretrain/ --max_ppl 1000

ํ•™์Šต ์‹คํ–‰

# 3B pretrain ์‹œ์ž‘ (8ร— B200, ~26์‹œ๊ฐ„)
bash scripts/run_pretrain.sh --config configs/korean_3b_fp8.yaml

# ์˜ˆ์ƒ ์ฒ˜๋ฆฌ ์†๋„: ~1.6M tok/s (8ร— B200)
# 150B tokens / 1.6M tok/s โ‰ˆ 26์‹œ๊ฐ„

๋ชจ๋‹ˆํ„ฐ๋ง

# ๋กœ๊ทธ ํ™•์ธ
tail -f checkpoints/korean_3b_fp8/train.log

# ์ค‘๊ฐ„ ์ฒดํฌํฌ์ธํŠธ์—์„œ base ํ’ˆ์งˆ ํ™•์ธ (step 10000๋งˆ๋‹ค)
python eval/perplexity.py --checkpoint checkpoints/korean_3b_fp8/checkpoint-0010000

์„ฑ๊ณต ๊ธฐ์ค€: PPL < 10 (ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ), loss ์ง€์† ํ•˜๊ฐ•


Phase 4: 3B SFT (~2์‹œ๊ฐ„)

1B์—์„œ ๋ฐฐ์šด ๊ตํ›ˆ ์ „๋ถ€ ์ ์šฉ

๊ตํ›ˆ ์ ์šฉ
Dynamic padding ์ž‘๋™ ํ™•์ธ โœ… sft_dataset.py ์ˆ˜์ • ์™„๋ฃŒ, ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
EOS ๋ณด์กด โœ… ๋™์ผ ์ฝ”๋“œ
Val split ํ•„์ˆ˜ โœ… 10% split
3-4 epoch โœ… MAX_STEPS ๊ณ„์‚ฐํ•˜์—ฌ ์„ค์ •
OpenOrca ๊ณผ๋‹ค ๊ฐ€์ค‘์น˜ ๋ฐฉ์ง€ โœ… 2.0x ์ดํ•˜
๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง โœ… Phase 0์—์„œ ์ƒ์„ฑํ•œ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ
์˜ฌ๋ฐ”๋ฅธ ํ”„๋กฌํ”„ํŠธ ํฌ๋งท โœ… <|user|>/<|assistant|>

์‹คํ–‰

RUN_NAME=korean_3b_sft \
BASE_CHECKPOINT=checkpoints/korean_3b_fp8/checkpoint-BEST \
SFT_DATA=data/sft_v2/train.jsonl \
VAL_DATA=data/sft_v2/val.jsonl \
MAX_STEPS=10000 \
LR=2.0e-5 \
WARMUP_STEPS=300 \
bash scripts/launch_sft.sh

์˜ˆ์ƒ ์‹œ๊ฐ„: ~2์‹œ๊ฐ„ (3B๋Š” 1B ๋Œ€๋น„ ~2.5x ๋А๋ฆผ)

์„ฑ๊ณต ๊ธฐ์ค€

์ง€ํ‘œ ๋ชฉํ‘œ
Train Loss < 1.85
Val Loss Train์˜ 1.1๋ฐฐ ์ด๋‚ด
๋ฐ˜๋ณต๋ฅ  (rep_penalty ์—†์ด) < 10%
๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) < 3%

Phase 5: ํ‰๊ฐ€ ๋ฐ ๋ฐฐํฌ (~4์‹œ๊ฐ„)

5-1. ์ „์ฒด ๋ฒค์น˜๋งˆํฌ (~2์‹œ๊ฐ„)

# ko_ifeval
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
    --tasks ko_ifeval --device cuda:0

# ko_winogrande
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
    --tasks ko_winogrande --device cuda:0

# KoBEST (์„ ํƒ)
lm_eval --model hf \
    --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
    --tasks kobest_boolq,kobest_copa,kobest_wic,kobest_hellaswag,kobest_sentineg \
    --device cuda:0

3B ๋ชฉํ‘œ ์ˆ˜์น˜:

๋ฒค์น˜๋งˆํฌ 1B ์˜ˆ์ƒ 3B ๋ชฉํ‘œ
ko_ifeval 20-30% 35-45%
ko_winogrande 53-58% 60-68%
KoBEST (avg) 55-60% 65-75%
๋ฐ˜๋ณต๋ฅ  <5% <3%

5-2. HuggingFace Hub ์—…๋กœ๋“œ (~1์‹œ๊ฐ„)

# HF ํฌ๋งท ๋ณ€ํ™˜
python scripts/convert_to_hf.py \
    --checkpoint checkpoints/korean_3b_sft/checkpoint-BEST \
    --output_dir hf_models/korean-3b-instruct

# Model card ์ž‘์„ฑ
cat > hf_models/korean-3b-instruct/README.md << 'EOF'
---
language: ko
license: apache-2.0
tags:
  - korean
  - llm
  - instruction-tuning
---
# Korean 3B Instruct
...๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ, ์‚ฌ์šฉ๋ฒ• ๋“ฑ...
EOF

# ์—…๋กœ๋“œ
huggingface-cli upload ghong/korean-3b-instruct hf_models/korean-3b-instruct

5-3. vLLM ์„œ๋น™ ์„ค์ • (~1์‹œ๊ฐ„)

# vLLM ์„œ๋ฒ„ ์‹œ์ž‘
python -m vllm.entrypoints.openai.api_server \
    --model hf_models/korean-3b-instruct \
    --dtype bfloat16 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --port 8000

# ํ…Œ์ŠคํŠธ
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "korean-3b-instruct",
        "messages": [{"role": "user", "content": "ํ•œ๊ตญ์˜ ์ˆ˜๋„๋Š”?"}],
        "temperature": 0.7
    }'

FP8 ์„œ๋น™ (B200 ์ตœ์ ):

python -m vllm.entrypoints.openai.api_server \
    --model hf_models/korean-3b-instruct \
    --quantization fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 4096

GGUF ๋ณ€ํ™˜ (Ollama/๋กœ์ปฌ ๋ฐฐํฌ):

bash scripts/convert_to_gguf.sh checkpoints/korean_3b_sft/checkpoint-BEST
# Ollama Modelfile ์ž‘์„ฑ ํ›„
ollama create korean-3b -f Modelfile

๐Ÿ“‹ Phase๋ณ„ ์š”์•ฝ ํ…Œ์ด๋ธ”

Phase ์†Œ์š” ์‹œ๊ฐ„ ํ•„์š”ํ•œ ๊ฒƒ ์„ฑ๊ณต ๊ธฐ์ค€ ์‹คํŒจ ์‹œ
0: ์ค€๋น„ 30๋ถ„ prepare_sft_data.py, sft_dataset.py ์ˆ˜์ • ํด๋ฆฐ ๋ฐ์ดํ„ฐ 120K+, sanity 100steps ํ†ต๊ณผ ์ฝ”๋“œ ๋””๋ฒ„๊ทธ
1: 1B SFT 40๋ถ„ 8ร—B200, ํด๋ฆฐ ๋ฐ์ดํ„ฐ, ์ˆ˜์ •๋œ ์ฝ”๋“œ Loss<1.90, ValLoss ์•ˆ์ • LR ์กฐ์ • or ๋ฐ์ดํ„ฐ ์žฌ์ ๊ฒ€
2: 1B ํ‰๊ฐ€ 2์‹œ๊ฐ„ lm-eval-harness, ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ ๋ฐ˜๋ณต๋ฅ <5%, ko_ifeval>25% Phase 3A
3A: ์ถ”๊ฐ€๊ฐœ์„  3-5์‹œ๊ฐ„ Preference ๋ฐ์ดํ„ฐ, ORPO/์ถ”๊ฐ€ SFT ๋ฐ˜๋ณต๋ฅ <5% ๋‹ฌ์„ฑ 1B ํ•œ๊ณ„ ์ธ์ •โ†’3B
3B: 3B PT 26์‹œ๊ฐ„ 150B+ ํ† ํฐ, configs/korean_3b_fp8.yaml PPL<10, loss ํ•˜๊ฐ• ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ or ์•„ํ‚คํ…์ฒ˜ ์กฐ์ •
4: 3B SFT 2์‹œ๊ฐ„ Phase 0์˜ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ Loss<1.85, ๋ฐ˜๋ณต๋ฅ <3% LR/epoch ์กฐ์ •
5: ๋ฐฐํฌ 4์‹œ๊ฐ„ HF ๊ณ„์ •, vLLM ko_ifeval>35%, ์„œ๋น™ ์ •์ƒ ๋ชจ๋ธ ๊ฐœ์„  ํ›„ ์žฌ๋ฐฐํฌ

๐Ÿ”ฅ ์˜ค๋Š˜ ๋‹น์žฅ ์‹œ์ž‘ํ•  ์ฒซ ๋ฒˆ์งธ ๋ช…๋ น์–ด

cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
python data/prepare_sft_data.py --output_dir data/sft_v2/ --val_split 0.1

์ด ๋ช…๋ น์–ด ํ•˜๋‚˜๋กœ Phase 0์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ž‘์—…(ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)์ด ์‹œ์ž‘๋œ๋‹ค.


โšก ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŒ๋‹จ ํฌ์ธํŠธ

1์ฐจ ํŒ๋‹จ: Phase 1 ์™„๋ฃŒ ํ›„ (Step 10000)

  • Val Loss๊ฐ€ Train Loss์˜ 1.2๋ฐฐ ์ด์ƒ? โ†’ ๊ณผ์ ํ•ฉ. Best checkpoint ์‚ฌ์šฉ.
  • Train Loss > 2.0? โ†’ ๋ฌด์–ธ๊ฐ€ ์ž˜๋ชป๋จ. ์ฝ”๋“œ/๋ฐ์ดํ„ฐ ์žฌ์ ๊ฒ€.

2์ฐจ ํŒ๋‹จ: Phase 2 ํ‰๊ฐ€ ํ›„ (๊ฐ€์žฅ ์ค‘์š”!)

  • ๋ฐ˜๋ณต๋ฅ  <5% AND ko_ifeval >25%? โ†’ โœ… 3B ์ „ํ™˜ (Phase 3B)
  • ๋ฐ˜๋ณต๋ฅ  5-15%? โ†’ โš ๏ธ ORPO ์‹œ๋„ (Phase 3A)
  • ๋ฐ˜๋ณต๋ฅ  >15%? โ†’ โŒ ์›์ธ ๋ถ„์„. ๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์žฌ๊ฒ€ํ† .

3์ฐจ ํŒ๋‹จ: Phase 3B ์ค‘๊ฐ„ (3B pretrain step 50000)

  • Loss ํ•˜๊ฐ• ๋ฉˆ์ถค? โ†’ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ. ํ•„ํ„ฐ๋ง ๊ฐ•ํ™”.
  • PPL > 15? โ†’ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ. ์ถ”๊ฐ€ ์ˆ˜์ง‘ ํ•„์š”.

๐Ÿ›ก๏ธ ๋ฆฌ์Šคํฌ ๋งคํŠธ๋ฆญ์Šค

๋ฆฌ์Šคํฌ ํ™•๋ฅ  ์˜ํ–ฅ ์˜ˆ๋ฐฉ/๋Œ€์‘
Dynamic padding ์—ฌ์ „ํžˆ ๋ฏธ์ž‘๋™ 10% ๋†’์Œ (์†๋„ 3-8x ๋‚ญ๋น„) Sanity check์—์„œ ๋ฐฐ์น˜ ๊ธธ์ด ํ™•์ธ
๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ๊ณผ๋‹ค (100K ๋ฏธ๋งŒ) 15% ์ค‘๊ฐ„ ํ•„ํ„ฐ ๊ธฐ์ค€ ์™„ํ™” (80์žโ†’50์ž)
1B ์žฌํ•™์Šต ํ›„์—๋„ ๋ฐ˜๋ณต >15% 15% ์ค‘๊ฐ„ ORPO or 3B ์ „ํ™˜
3B pretrain ์ค‘ OOM 10% ๋†’์Œ batch_size ์ค„์ด๊ธฐ, gradient checkpointing
cc100 ์žฌ์ˆ˜์ง‘ ์‹œ๊ฐ„ ์ดˆ๊ณผ 20% ๋‚ฎ์Œ CulturaX๋งŒ์œผ๋กœ ์‹œ์ž‘ (24.8B)
๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ 5% ๋†’์Œ ํ˜„์žฌ 19TB ๊ฐ€์šฉ, ์ถฉ๋ถ„

"40๋ถ„ ์•„๋ผ๋ ค๊ณ  ๊ธฐ์ˆ  ๋ถ€์ฑ„๋ฅผ ์•ˆ๊ณ  ๊ฐ€์ง€ ๋งˆ๋ผ. 3์‹œ๊ฐ„ ํˆฌ์žํ•ด์„œ ๊นจ๋—ํ•œ ๊ธฐ๋ฐ˜์„ ๋งŒ๋“ค์–ด๋ผ."

์ด ๋ฌธ์„œ๋Š” ๊ฐ Phase ์™„๋ฃŒ ์‹œ ๊ฒฐ๊ณผ๋กœ ์—…๋ฐ์ดํŠธํ•  ๊ฒƒ.