# ๐Ÿ—บ๏ธ MASTER PLAN: ํ•œ๊ตญ์–ด LLM 1B ์žฌํ•™์Šต โ†’ 3B โ†’ ๋ฐฐํฌ **์ž‘์„ฑ์ผ**: 2026-02-27 **ํ”„๋กœ์ ํŠธ**: `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/` **๊ฒฐ์ •**: Restart (base checkpoint์—์„œ ํด๋ฆฐ ์žฌํ•™์Šต) **์ด ์˜ˆ์ƒ ๊ธฐ๊ฐ„**: ~35์‹œ๊ฐ„ (1B: 3์‹œ๊ฐ„ โ†’ 3B pretrain: 26์‹œ๊ฐ„ โ†’ 3B SFT+ํ‰๊ฐ€: 6์‹œ๊ฐ„) --- ## ๐Ÿ“Š ์ „์ฒด ํƒ€์ž„๋ผ์ธ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ ``` Phase 0 โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 30๋ถ„ ๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์ค€๋น„ Phase 1 โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 40๋ถ„ 1B SFT ์žฌํ•™์Šต Phase 2 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 2์‹œ๊ฐ„ 1B ํ‰๊ฐ€ โ”€โ”€โ”€โ”€โ”€โ”€ ์—ฌ๊ธฐ์„œ ํŒ๋‹จ โ”€โ”€โ”€โ”€โ”€โ”€ Phase 3A โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 3-5์‹œ๊ฐ„ (์กฐ๊ฑด๋ถ€) 1B ์ถ”๊ฐ€ ๊ฐœ์„  Phase 3B โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 26์‹œ๊ฐ„ 3B ์‚ฌ์ „ํ•™์Šต Phase 4 โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 2์‹œ๊ฐ„ 3B SFT Phase 5 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 4์‹œ๊ฐ„ ํ‰๊ฐ€ & ๋ฐฐํฌ ``` --- ## Phase 0: ์žฌํ•™์Šต ์ง์ „ ์ค€๋น„ (์˜ค๋Š˜, ~30๋ถ„) ### ์ฒดํฌ๋ฆฌ์ŠคํŠธ #### โ˜ 0-1. ๋ฐ์ดํ„ฐ ์žฌ์ƒ์„ฑ (~20๋ถ„) ```bash cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang # prepare_sft_data.py ์žฌ์‹คํ–‰ (๊ฐ•ํ™” ํ•„ํ„ฐ + ์ˆ˜์ •๋œ ๊ฐ€์ค‘์น˜) python data/prepare_sft_data.py \ --output_dir data/sft_v2/ \ --val_split 0.1 ``` **ํ™•์ธ ์‚ฌํ•ญ**: - ํ•„ํ„ฐ๋ง ํ›„ **120K-135K ์ƒ˜ํ”Œ** ๋‚จ์•„์•ผ ํ•จ (๊ธฐ์กด 159K์—์„œ ์ €ํ’ˆ์งˆ ์ œ๊ฑฐ) - `` ๋ฆฌํ„ฐ๋Ÿด 113๊ฑด, Q/A ๋งˆ์ปค ~550๊ฑด, ์ž์ฒด๋ฐ˜๋ณต 57๊ฑด ์ œ๊ฑฐ ํ™•์ธ - OpenOrca ๊ฐ€์ค‘์น˜: 5.0 โ†’ 2.0์œผ๋กœ ๊ฐ์†Œ ํ™•์ธ - Val split: ~12-13K ์ƒ˜ํ”Œ (10%) - ์งง์€ output (<80์ž) ์ œ๊ฑฐ ํ™•์ธ ```bash # ๊ฒฐ๊ณผ ํ™•์ธ wc -l data/sft_v2/train.jsonl data/sft_v2/val.jsonl # ์˜ˆ์ƒ: train ~108K-120K, val ~12K-13K ``` **์™„๋ฃŒ ๊ธฐ์ค€**: train 100K+ ์ƒ˜ํ”Œ, val 10K+ ์ƒ˜ํ”Œ. ์ œ๊ฑฐ๋œ ์ƒ˜ํ”Œ spot check ์‹œ ์‹ค์ œ ์ €ํ’ˆ์งˆ. #### โ˜ 0-2. sft_dataset.py ์ˆ˜์ • ํ™•์ธ (~5๋ถ„) ์ด๋ฏธ ์ˆ˜์ •๋œ ํ•ญ๋ชฉ ํ™•์ธ: | ์ˆ˜์ • ์‚ฌํ•ญ | ํŒŒ์ผ | ํ™•์ธ | |-----------|------|------| | Dynamic padding ์‹ค์ œ ์ž‘๋™ | `data/sft_dataset.py` `__getitem__` | โ˜ ํŒจ๋”ฉ ์—†์ด ์‹ค์ œ ๊ธธ์ด ํ…์„œ ๋ฐ˜ํ™˜ | | EOS ๋ณด์กด | `data/sft_dataset.py` L130-134 | โ˜ `response_ids[:allowed-1] + [eos_id]` | | Collate fn | `data/sft_dataset.py` `dynamic_collate_fn` | โ˜ ๋ฐฐ์น˜๋ณ„ ๊ฐ€๋ณ€ ํŒจ๋”ฉ | ```bash # ํ•ต์‹ฌ ์ฝ”๋“œ ํ™•์ธ grep -n "allowed_response" data/sft_dataset.py grep -n "eos_token_id" data/sft_dataset.py grep -n "torch.full" data/sft_dataset.py # 4096 ๊ณ ์ • ํŒจ๋”ฉ ์—†์–ด์•ผ ํ•จ ``` #### โ˜ 0-3. launch_sft.sh ์ˆ˜์ • (~5๋ถ„) ```bash # ๋ณ€๊ฒฝํ•  ๊ฐ’๋“ค: # RUN_NAME=korean_1b_sft_v2 # SFT_DATA=data/sft_v2/train.jsonl # VAL_DATA=data/sft_v2/val.jsonl # MAX_STEPS=10000 (3-4 epoch, ๊ธฐ์กด 5000์—์„œ ์ฆ๊ฐ€) # WARMUP_STEPS=300 (3%) cp scripts/launch_sft.sh scripts/launch_sft_v2.sh # ํŽธ์ง‘ ํ›„ diff ํ™•์ธ ``` #### โ˜ 0-4. Sanity Check (~5๋ถ„) ```bash # 100 steps๋งŒ ๋น ๋ฅด๊ฒŒ ๋Œ๋ ค์„œ ํŒŒ์ดํ”„๋ผ์ธ ์ •์ƒ ํ™•์ธ bash scripts/launch_sft_v2.sh --max_steps 100 # ํ™•์ธ: # - Loss๊ฐ€ 2.0-2.5 ๋ฒ”์œ„์—์„œ ์‹œ์ž‘ํ•˜๋Š”๊ฐ€? โœ… # - ๋ฐฐ์น˜ ๋‚ด ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๊ฐ€๋ณ€์ ์ธ๊ฐ€? (๋กœ๊ทธ์—์„œ ํ™•์ธ) โœ… # - Val loss๊ฐ€ ์ถœ๋ ฅ๋˜๋Š”๊ฐ€? โœ… # - OOM ์—†๋Š”๊ฐ€? โœ… ``` **์™„๋ฃŒ ๊ธฐ์ค€**: 100 steps ์—๋Ÿฌ ์—†์ด ์™„๋ฃŒ, loss ํ•ฉ๋ฆฌ์  ๋ฒ”์œ„, val loss ์ถœ๋ ฅ ํ™•์ธ. --- ## Phase 1: 1B SFT ์žฌํ•™์Šต (์˜ค๋Š˜, ~40๋ถ„) ### ์‹คํ–‰ ๋ช…๋ น์–ด ```bash cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang RUN_NAME=korean_1b_sft_v2 \ BASE_CHECKPOINT=checkpoints/korean_1b_fp8_run1/checkpoint-0034000 \ SFT_DATA=data/sft_v2/train.jsonl \ VAL_DATA=data/sft_v2/val.jsonl \ MAX_STEPS=10000 \ WARMUP_STEPS=300 \ LR=2.0e-5 \ bash scripts/launch_sft.sh ``` ### ๋ชจ๋‹ˆํ„ฐ๋ง **์‹ค์‹œ๊ฐ„ ๋กœ๊ทธ**: ```bash tail -f checkpoints/korean_1b_sft_v2/train.log ``` **TensorBoard**: ```bash tensorboard --logdir checkpoints/korean_1b_sft_v2/tensorboard --port 6007 ``` **ํ•ต์‹ฌ ์ˆ˜์น˜**: | ์ˆ˜์น˜ | ์ •์ƒ ๋ฒ”์œ„ | ๊ฒฝ๊ณ  | ์ฆ‰์‹œ ์ค‘๋‹จ | |------|----------|------|----------| | Train Loss | ์‹œ์ž‘ 2.0-2.5, ์ตœ์ข… <1.90 | >2.5 at step 500+ | >3.0 (๋ฐœ์‚ฐ) | | Val Loss | Train์˜ 1.0-1.1๋ฐฐ | Train์˜ 1.2๋ฐฐ | Train ๋Œ€๋น„ ๊ณ„์† ์ƒ์Šน (๊ณผ์ ํ•ฉ) | | GNorm | 0.8-1.5 | >2.0 | >5.0 (gradient ํญ๋ฐœ) | | ํ•™์Šต ์†๋„ | ๊ธฐ์กด ๋Œ€๋น„ 2x+ (dynamic padding ํšจ๊ณผ) | ๊ธฐ์กด๊ณผ ๋น„์Šท | ๊ธฐ์กด๋ณด๋‹ค ๋А๋ฆผ | **์ฒดํฌํฌ์ธํŠธ ๊ด€์ฐฐ**: - Step 500: ํŒŒ์ดํ”„๋ผ์ธ ์•ˆ์ •์„ฑ ํ™•์ธ - Step 2500: ์ค‘๊ฐ„ ์ง€์ , loss ์ถ”์„ธ ํ™•์ธ - Step 5000: ๊ธฐ์กด ํ•™์Šต๊ณผ ๋น„๊ต (loss < 1.97์ด์–ด์•ผ ํ•จ) - Step 7500: ์ˆ˜๋ ด ์—ฌ๋ถ€ ํ™•์ธ - Step 10000: ์ตœ์ข… ### ์„ฑ๊ณต ๊ธฐ์ค€ | ์ง€ํ‘œ | ๋ชฉํ‘œ | ์‹คํŒจ ๊ธฐ์ค€ | |------|------|----------| | Final Train Loss | < 1.90 | > 2.00 | | Final Val Loss | < 2.00 | Train ๋Œ€๋น„ 1.2๋ฐฐ ์ดˆ๊ณผ | | Val Loss ์ถ”์„ธ | ํ•˜๊ฐ• or ์•ˆ์ • | 3์—ฐ์† ์ƒ์Šน (๊ณผ์ ํ•ฉ) | | ํ•™์Šต ์‹œ๊ฐ„ | ~40-60๋ถ„ | >2์‹œ๊ฐ„ (dynamic padding ๋ฏธ์ž‘๋™) | ### ์‹คํŒจ ์‹œ ๋Œ€์‘ | ์ƒํ™ฉ | ์›์ธ ์ถ”์ • | ๋Œ€์‘ | |------|----------|------| | Loss ๋ฐœ์‚ฐ (>3.0) | LR ๊ณผ๋‹ค or ๋ฐ์ดํ„ฐ ๋ฒ„๊ทธ | LR=1e-5๋กœ ์žฌ์‹œ๋„ | | OOM | ๋ฐฐ์น˜ ํฌ๊ธฐ ๊ณผ๋‹ค | BATCH_SIZE=2๋กœ ๊ฐ์†Œ | | Loss ์ •์ฒด (step 2000+ ๋ณ€ํ™” ์—†์Œ) | LR ๋ถ€์กฑ or ๋ฐ์ดํ„ฐ ๋ฌธ์ œ | ๋ฐ์ดํ„ฐ ์ ๊ฒ€, LR=3e-5 ์‹œ๋„ | | Val Loss ๋ฐœ์‚ฐ (๊ณผ์ ํ•ฉ) | Epoch ๊ณผ๋‹ค | Early stop at best val checkpoint | | ํ•™์Šต ์†๋„ ๊ธฐ์กด๊ณผ ๊ฐ™์Œ | Dynamic padding ๋ฏธ์ž‘๋™ | sft_dataset.py ์žฌ์ ๊ฒ€ | --- ## Phase 2: 1B SFT ํ‰๊ฐ€ (~2์‹œ๊ฐ„) ### ํ‰๊ฐ€ ์ˆœ์„œ #### 2-1. ๋ฐ˜๋ณต๋ฅ  ์ธก์ • (30๋ถ„) ```bash # ์˜ฌ๋ฐ”๋ฅธ ํฌ๋งท(<|user|>/<|assistant|>)์œผ๋กœ ์ƒ์„ฑ ํ…Œ์ŠคํŠธ python eval/test_generation_params.py \ --checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000 # ๋‹ค์–‘ํ•œ rep_penalty ํ…Œ์ŠคํŠธ # rep_penalty=1.0 (์—†์Œ): ๋ชฉํ‘œ <10% # rep_penalty=1.1: ๋ชฉํ‘œ <3% # rep_penalty=1.2: ๋ชฉํ‘œ <1% ``` #### 2-2. ์ƒ์„ฑ ํ’ˆ์งˆ ์ฃผ๊ด€ ํ‰๊ฐ€ (30๋ถ„) ```bash python eval/generate.py \ --checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000 \ --prompts_file eval/test_prompts.txt \ --temperature 0.8 --top_p 0.9 ``` **์ฒดํฌ**: ํ•œ๊ตญ์–ด ์ž์—ฐ์Šค๋Ÿฌ์›€, instruction following, EOS ์ •์ƒ ์ข…๋ฃŒ #### 2-3. ๊ณต์‹ ๋ฒค์น˜๋งˆํฌ (1์‹œ๊ฐ„) ```bash # ko_ifeval lm_eval --model hf \ --model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \ --tasks ko_ifeval \ --device cuda:0 \ --output_path eval/results/sft_v2_ko_ifeval.json # ko_winogrande (์„ ํƒ) lm_eval --model hf \ --model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \ --tasks ko_winogrande \ --device cuda:0 \ --output_path eval/results/sft_v2_ko_winogrande.json ``` ### ํŒ๋‹จ ๊ธฐ์ค€ & ๋ถ„๊ธฐ ``` [Phase 2 ํ‰๊ฐ€ ๊ฒฐ๊ณผ] โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โœ… PASS โš ๏ธ PARTIAL โŒ FAIL ๋ฐ˜๋ณต๋ฅ <5% ๋ฐ˜๋ณต๋ฅ  5-15% ๋ฐ˜๋ณต๋ฅ >15% ko_ifeval>25% ko_ifeval 15-25% ko_ifeval<15% โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ Phase 3B Phase 3A ์›์ธ ๋ถ„์„ (3B ์ „ํ™˜) (์ถ”๊ฐ€ ๊ฐœ์„ ) (๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์žฌ๊ฒ€ํ† ) ``` **์ƒ์„ธ ๊ธฐ์ค€**: | ์ง€ํ‘œ | โœ… Pass | โš ๏ธ ์ถ”๊ฐ€ ์กฐ์ • | โŒ ์žฌํ•™์Šต | |------|---------|-------------|----------| | ๋ฐ˜๋ณต๋ฅ  (rep_penalty ์—†์ด) | <10% | 10-20% | >20% | | ๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) | <5% | 5-15% | >15% | | ko_ifeval | >25% | 15-25% | <15% | | EOS ์ •์ƒ ์ข…๋ฃŒ์œจ | >85% | 60-85% | <60% | --- ## Phase 3A: 1B ์ถ”๊ฐ€ ๊ฐœ์„  (์กฐ๊ฑด๋ถ€, ~3-5์‹œ๊ฐ„) > **Phase 2 ๊ฒฐ๊ณผ๊ฐ€ โš ๏ธ PARTIAL์ผ ๋•Œ๋งŒ ์ง„์ž…** ### ์˜ต์…˜ A: ORPO ํ•™์Šต (~3์‹œ๊ฐ„) #### Preference Data ์ค€๋น„ (1์‹œ๊ฐ„) ```bash # ํ•œ๊ตญ์–ด preference ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ python -c " from datasets import load_dataset # ์˜ต์…˜ 1: ko_Ultrafeedback (60K, ์ผ๋ฐ˜ ๋„๋ฉ”์ธ) ds = load_dataset('maywell/ko_Ultrafeedback') # ์˜ต์…˜ 2: ์ž์ฒด ์ƒ์„ฑ (ํ˜„์žฌ ๋ชจ๋ธ๋กœ rejected ์ƒ์„ฑ) " ``` **์ž์ฒด ์ƒ์„ฑ ๋ฐฉ๋ฒ•**: 1. ํ˜„์žฌ SFT ๋ชจ๋ธ๋กœ ๋™์ผ ํ”„๋กฌํ”„ํŠธ์— ์—ฌ๋Ÿฌ ๋ฒˆ ์ƒ์„ฑ 2. ๋ฐ˜๋ณต/์ €ํ’ˆ์งˆ ์ถœ๋ ฅ โ†’ rejected 3. ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ์˜ ์ •๋‹ต โ†’ chosen 4. ~10K-20K ์Œ ์ƒ์„ฑ #### ORPO ํ•™์Šต (1.5์‹œ๊ฐ„) ```python from trl import ORPOConfig, ORPOTrainer config = ORPOConfig( learning_rate=5e-7, num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=2, beta=0.1, # ORPO coefficient ) trainer = ORPOTrainer(model, config, train_dataset=preference_data) trainer.train() ``` #### ํ‰๊ฐ€ (30๋ถ„) - ๋ฐ˜๋ณต๋ฅ  ์žฌ์ธก์ •: ๋ชฉํ‘œ <5% (rep_penalty=1.1) - ko_ifeval ์žฌ์ธก์ •: ๋ชฉํ‘œ >20% ### ์˜ต์…˜ B: ์ถ”๊ฐ€ SFT (๋ฐ์ดํ„ฐ ๋ณด๊ฐ•, ~5์‹œ๊ฐ„) #### ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (2์‹œ๊ฐ„) ```python from datasets import load_dataset # ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ datasets = { "hPark/orca-ko": 200_000, # ๊ณ ํ’ˆ์งˆ ํ•ฉ์„ฑ "nayohan/llama3-instruct-ko-dataset": 58_000, # Llama3 ํ•œ๊ตญ์–ด "FreedomIntelligence/evol-instruct-korean": 70_000, # GPT-4 ์ƒ์„ฑ } # ๊ธฐ์กด 120K + ์ถ”๊ฐ€ ~300K โ†’ ํ•„ํ„ฐ ํ›„ ~350K ``` #### ์žฌํ•™์Šต (2์‹œ๊ฐ„) ```bash # ์ฆ๊ฐ€๋œ ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šต RUN_NAME=korean_1b_sft_v3 \ SFT_DATA=data/sft_v3/train.jsonl \ MAX_STEPS=15000 \ bash scripts/launch_sft.sh ``` ### Phase 3A ์„ฑ๊ณต ๊ธฐ์ค€ | ์ง€ํ‘œ | ๋ชฉํ‘œ | |------|------| | ๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) | <5% | | ko_ifeval | >20% | **์‹คํŒจ ์‹œ**: 1B ํ•œ๊ณ„ ์ธ์ •, Phase 3B (3B ์ „ํ™˜)๋กœ ๋ฐ”๋กœ ์ด๋™. --- ## Phase 3B: 3B ์‚ฌ์ „ํ•™์Šต (Phase 2 ํ†ต๊ณผ ํ›„, ~26์‹œ๊ฐ„) ### 3B ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ | ํŒŒ๋ผ๋ฏธํ„ฐ | 1B (ํ˜„์žฌ) | 3B (๋ชฉํ‘œ) | ๋น„๊ณ  | |---------|----------|----------|------| | d_model | 2048 | 2560 | ~1.25x | | n_layers | 24 | 32 | ~1.33x | | n_heads | 16 | 32 | 2x | | n_kv_heads (GQA) | 4 | 8 | 2x | | d_ffn | 5472 | 6912 | ~1.26x | | vocab_size | 64000 | 64000 | ๋™์ผ | | max_seq_len | 4096 | 4096 | ๋™์ผ | | **์ด ํŒŒ๋ผ๋ฏธํ„ฐ** | **1.19B** | **~3.0B** | ~2.5x | ### ์„ค์ • ํŒŒ์ผ ์ž‘์„ฑ ```bash # configs/korean_3b_fp8.yaml ์ž‘์„ฑ cat > configs/korean_3b_fp8.yaml << 'EOF' model: d_model: 2560 n_layers: 32 n_heads: 32 n_kv_heads: 8 d_ffn: 6912 vocab_size: 64000 max_seq_len: 4096 rope_theta: 500000 training: lr: 3.0e-4 min_lr: 3.0e-5 warmup_steps: 2000 max_steps: 100000 batch_size: 4 grad_accum: 4 weight_decay: 0.1 use_fp8: true data: sources: - cc100_ko - culturax_ko - existing_pretrain EOF ``` ### ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ | ์†Œ์Šค | ํ† ํฐ ์ˆ˜ | ์ƒํƒœ | |------|---------|------| | CulturaX ko | 24.8B | โœ… ๋ณด์œ  | | cc100 ko (์žฌ์ˆ˜์ง‘) | ~65-100B | โš ๏ธ ์žฌ์ˆ˜์ง‘ ํ•„์š” (๋…ธ์ด์ฆˆ ํ•„ํ„ฐ๋ง) | | ๊ธฐ์กด pretrain ๋ฐ์ดํ„ฐ | ~8.9B | โœ… ๋ณด์œ  | | ์ถ”๊ฐ€ ์ˆ˜์ง‘ (๋‚˜๋ฌด์œ„ํ‚ค, ๋‰ด์Šค ๋“ฑ) | ~20-50B | ์„ ํƒ์  | | **ํ•ฉ๊ณ„** | **~120-180B** | Chinchilla 60B ์ตœ์†Œ ์ถฉ์กฑ | **๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ช…๋ น์–ด**: ```bash # cc100 ์žฌ์ˆ˜์ง‘ + ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง python scripts/download_cc100_ko.py --quality_filter --dedup # MinHash dedup + perplexity filter python scripts/quality_filter.py --input data/pretrain/ --max_ppl 1000 ``` ### ํ•™์Šต ์‹คํ–‰ ```bash # 3B pretrain ์‹œ์ž‘ (8ร— B200, ~26์‹œ๊ฐ„) bash scripts/run_pretrain.sh --config configs/korean_3b_fp8.yaml # ์˜ˆ์ƒ ์ฒ˜๋ฆฌ ์†๋„: ~1.6M tok/s (8ร— B200) # 150B tokens / 1.6M tok/s โ‰ˆ 26์‹œ๊ฐ„ ``` ### ๋ชจ๋‹ˆํ„ฐ๋ง ```bash # ๋กœ๊ทธ ํ™•์ธ tail -f checkpoints/korean_3b_fp8/train.log # ์ค‘๊ฐ„ ์ฒดํฌํฌ์ธํŠธ์—์„œ base ํ’ˆ์งˆ ํ™•์ธ (step 10000๋งˆ๋‹ค) python eval/perplexity.py --checkpoint checkpoints/korean_3b_fp8/checkpoint-0010000 ``` **์„ฑ๊ณต ๊ธฐ์ค€**: PPL < 10 (ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ), loss ์ง€์† ํ•˜๊ฐ• --- ## Phase 4: 3B SFT (~2์‹œ๊ฐ„) ### 1B์—์„œ ๋ฐฐ์šด ๊ตํ›ˆ ์ „๋ถ€ ์ ์šฉ | ๊ตํ›ˆ | ์ ์šฉ | |------|------| | Dynamic padding ์ž‘๋™ ํ™•์ธ | โœ… sft_dataset.py ์ˆ˜์ • ์™„๋ฃŒ, ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ | | EOS ๋ณด์กด | โœ… ๋™์ผ ์ฝ”๋“œ | | Val split ํ•„์ˆ˜ | โœ… 10% split | | 3-4 epoch | โœ… MAX_STEPS ๊ณ„์‚ฐํ•˜์—ฌ ์„ค์ • | | OpenOrca ๊ณผ๋‹ค ๊ฐ€์ค‘์น˜ ๋ฐฉ์ง€ | โœ… 2.0x ์ดํ•˜ | | ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง | โœ… Phase 0์—์„œ ์ƒ์„ฑํ•œ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ | | ์˜ฌ๋ฐ”๋ฅธ ํ”„๋กฌํ”„ํŠธ ํฌ๋งท | โœ… `<\|user\|>/<\|assistant\|>` | ### ์‹คํ–‰ ```bash RUN_NAME=korean_3b_sft \ BASE_CHECKPOINT=checkpoints/korean_3b_fp8/checkpoint-BEST \ SFT_DATA=data/sft_v2/train.jsonl \ VAL_DATA=data/sft_v2/val.jsonl \ MAX_STEPS=10000 \ LR=2.0e-5 \ WARMUP_STEPS=300 \ bash scripts/launch_sft.sh ``` **์˜ˆ์ƒ ์‹œ๊ฐ„**: ~2์‹œ๊ฐ„ (3B๋Š” 1B ๋Œ€๋น„ ~2.5x ๋А๋ฆผ) ### ์„ฑ๊ณต ๊ธฐ์ค€ | ์ง€ํ‘œ | ๋ชฉํ‘œ | |------|------| | Train Loss | < 1.85 | | Val Loss | Train์˜ 1.1๋ฐฐ ์ด๋‚ด | | ๋ฐ˜๋ณต๋ฅ  (rep_penalty ์—†์ด) | < 10% | | ๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) | < 3% | --- ## Phase 5: ํ‰๊ฐ€ ๋ฐ ๋ฐฐํฌ (~4์‹œ๊ฐ„) ### 5-1. ์ „์ฒด ๋ฒค์น˜๋งˆํฌ (~2์‹œ๊ฐ„) ```bash # ko_ifeval lm_eval --model hf \ --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \ --tasks ko_ifeval --device cuda:0 # ko_winogrande lm_eval --model hf \ --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \ --tasks ko_winogrande --device cuda:0 # KoBEST (์„ ํƒ) lm_eval --model hf \ --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \ --tasks kobest_boolq,kobest_copa,kobest_wic,kobest_hellaswag,kobest_sentineg \ --device cuda:0 ``` **3B ๋ชฉํ‘œ ์ˆ˜์น˜**: | ๋ฒค์น˜๋งˆํฌ | 1B ์˜ˆ์ƒ | 3B ๋ชฉํ‘œ | |---------|---------|---------| | ko_ifeval | 20-30% | **35-45%** | | ko_winogrande | 53-58% | **60-68%** | | KoBEST (avg) | 55-60% | **65-75%** | | ๋ฐ˜๋ณต๋ฅ  | <5% | **<3%** | ### 5-2. HuggingFace Hub ์—…๋กœ๋“œ (~1์‹œ๊ฐ„) ```bash # HF ํฌ๋งท ๋ณ€ํ™˜ python scripts/convert_to_hf.py \ --checkpoint checkpoints/korean_3b_sft/checkpoint-BEST \ --output_dir hf_models/korean-3b-instruct # Model card ์ž‘์„ฑ cat > hf_models/korean-3b-instruct/README.md << 'EOF' --- language: ko license: apache-2.0 tags: - korean - llm - instruction-tuning --- # Korean 3B Instruct ...๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ, ์‚ฌ์šฉ๋ฒ• ๋“ฑ... EOF # ์—…๋กœ๋“œ huggingface-cli upload ghong/korean-3b-instruct hf_models/korean-3b-instruct ``` ### 5-3. vLLM ์„œ๋น™ ์„ค์ • (~1์‹œ๊ฐ„) ```bash # vLLM ์„œ๋ฒ„ ์‹œ์ž‘ python -m vllm.entrypoints.openai.api_server \ --model hf_models/korean-3b-instruct \ --dtype bfloat16 \ --tensor-parallel-size 1 \ --max-model-len 4096 \ --port 8000 # ํ…Œ์ŠคํŠธ curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "korean-3b-instruct", "messages": [{"role": "user", "content": "ํ•œ๊ตญ์˜ ์ˆ˜๋„๋Š”?"}], "temperature": 0.7 }' ``` **FP8 ์„œ๋น™ (B200 ์ตœ์ )**: ```bash python -m vllm.entrypoints.openai.api_server \ --model hf_models/korean-3b-instruct \ --quantization fp8 \ --tensor-parallel-size 1 \ --max-model-len 4096 ``` **GGUF ๋ณ€ํ™˜ (Ollama/๋กœ์ปฌ ๋ฐฐํฌ)**: ```bash bash scripts/convert_to_gguf.sh checkpoints/korean_3b_sft/checkpoint-BEST # Ollama Modelfile ์ž‘์„ฑ ํ›„ ollama create korean-3b -f Modelfile ``` --- ## ๐Ÿ“‹ Phase๋ณ„ ์š”์•ฝ ํ…Œ์ด๋ธ” | Phase | ์†Œ์š” ์‹œ๊ฐ„ | ํ•„์š”ํ•œ ๊ฒƒ | ์„ฑ๊ณต ๊ธฐ์ค€ | ์‹คํŒจ ์‹œ | |-------|----------|----------|----------|---------| | **0: ์ค€๋น„** | 30๋ถ„ | prepare_sft_data.py, sft_dataset.py ์ˆ˜์ • | ํด๋ฆฐ ๋ฐ์ดํ„ฐ 120K+, sanity 100steps ํ†ต๊ณผ | ์ฝ”๋“œ ๋””๋ฒ„๊ทธ | | **1: 1B SFT** | 40๋ถ„ | 8ร—B200, ํด๋ฆฐ ๋ฐ์ดํ„ฐ, ์ˆ˜์ •๋œ ์ฝ”๋“œ | Loss<1.90, ValLoss ์•ˆ์ • | LR ์กฐ์ • or ๋ฐ์ดํ„ฐ ์žฌ์ ๊ฒ€ | | **2: 1B ํ‰๊ฐ€** | 2์‹œ๊ฐ„ | lm-eval-harness, ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ | ๋ฐ˜๋ณต๋ฅ <5%, ko_ifeval>25% | Phase 3A | | **3A: ์ถ”๊ฐ€๊ฐœ์„ ** | 3-5์‹œ๊ฐ„ | Preference ๋ฐ์ดํ„ฐ, ORPO/์ถ”๊ฐ€ SFT | ๋ฐ˜๋ณต๋ฅ <5% ๋‹ฌ์„ฑ | 1B ํ•œ๊ณ„ ์ธ์ •โ†’3B | | **3B: 3B PT** | 26์‹œ๊ฐ„ | 150B+ ํ† ํฐ, configs/korean_3b_fp8.yaml | PPL<10, loss ํ•˜๊ฐ• | ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ or ์•„ํ‚คํ…์ฒ˜ ์กฐ์ • | | **4: 3B SFT** | 2์‹œ๊ฐ„ | Phase 0์˜ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ | Loss<1.85, ๋ฐ˜๋ณต๋ฅ <3% | LR/epoch ์กฐ์ • | | **5: ๋ฐฐํฌ** | 4์‹œ๊ฐ„ | HF ๊ณ„์ •, vLLM | ko_ifeval>35%, ์„œ๋น™ ์ •์ƒ | ๋ชจ๋ธ ๊ฐœ์„  ํ›„ ์žฌ๋ฐฐํฌ | --- ## ๐Ÿ”ฅ ์˜ค๋Š˜ ๋‹น์žฅ ์‹œ์ž‘ํ•  ์ฒซ ๋ฒˆ์งธ ๋ช…๋ น์–ด ```bash cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang python data/prepare_sft_data.py --output_dir data/sft_v2/ --val_split 0.1 ``` ์ด ๋ช…๋ น์–ด ํ•˜๋‚˜๋กœ Phase 0์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ž‘์—…(ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)์ด ์‹œ์ž‘๋œ๋‹ค. --- ## โšก ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŒ๋‹จ ํฌ์ธํŠธ ### 1์ฐจ ํŒ๋‹จ: Phase 1 ์™„๋ฃŒ ํ›„ (Step 10000) - **Val Loss๊ฐ€ Train Loss์˜ 1.2๋ฐฐ ์ด์ƒ?** โ†’ ๊ณผ์ ํ•ฉ. Best checkpoint ์‚ฌ์šฉ. - **Train Loss > 2.0?** โ†’ ๋ฌด์–ธ๊ฐ€ ์ž˜๋ชป๋จ. ์ฝ”๋“œ/๋ฐ์ดํ„ฐ ์žฌ์ ๊ฒ€. ### 2์ฐจ ํŒ๋‹จ: Phase 2 ํ‰๊ฐ€ ํ›„ (๊ฐ€์žฅ ์ค‘์š”!) - **๋ฐ˜๋ณต๋ฅ  <5% AND ko_ifeval >25%?** โ†’ โœ… 3B ์ „ํ™˜ (Phase 3B) - **๋ฐ˜๋ณต๋ฅ  5-15%?** โ†’ โš ๏ธ ORPO ์‹œ๋„ (Phase 3A) - **๋ฐ˜๋ณต๋ฅ  >15%?** โ†’ โŒ ์›์ธ ๋ถ„์„. ๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์žฌ๊ฒ€ํ† . ### 3์ฐจ ํŒ๋‹จ: Phase 3B ์ค‘๊ฐ„ (3B pretrain step 50000) - **Loss ํ•˜๊ฐ• ๋ฉˆ์ถค?** โ†’ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ. ํ•„ํ„ฐ๋ง ๊ฐ•ํ™”. - **PPL > 15?** โ†’ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ. ์ถ”๊ฐ€ ์ˆ˜์ง‘ ํ•„์š”. --- ## ๐Ÿ›ก๏ธ ๋ฆฌ์Šคํฌ ๋งคํŠธ๋ฆญ์Šค | ๋ฆฌ์Šคํฌ | ํ™•๋ฅ  | ์˜ํ–ฅ | ์˜ˆ๋ฐฉ/๋Œ€์‘ | |--------|------|------|----------| | Dynamic padding ์—ฌ์ „ํžˆ ๋ฏธ์ž‘๋™ | 10% | ๋†’์Œ (์†๋„ 3-8x ๋‚ญ๋น„) | Sanity check์—์„œ ๋ฐฐ์น˜ ๊ธธ์ด ํ™•์ธ | | ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ๊ณผ๋‹ค (100K ๋ฏธ๋งŒ) | 15% | ์ค‘๊ฐ„ | ํ•„ํ„ฐ ๊ธฐ์ค€ ์™„ํ™” (80์žโ†’50์ž) | | 1B ์žฌํ•™์Šต ํ›„์—๋„ ๋ฐ˜๋ณต >15% | 15% | ์ค‘๊ฐ„ | ORPO or 3B ์ „ํ™˜ | | 3B pretrain ์ค‘ OOM | 10% | ๋†’์Œ | batch_size ์ค„์ด๊ธฐ, gradient checkpointing | | cc100 ์žฌ์ˆ˜์ง‘ ์‹œ๊ฐ„ ์ดˆ๊ณผ | 20% | ๋‚ฎ์Œ | CulturaX๋งŒ์œผ๋กœ ์‹œ์ž‘ (24.8B) | | ๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ | 5% | ๋†’์Œ | ํ˜„์žฌ 19TB ๊ฐ€์šฉ, ์ถฉ๋ถ„ | --- *"40๋ถ„ ์•„๋ผ๋ ค๊ณ  ๊ธฐ์ˆ  ๋ถ€์ฑ„๋ฅผ ์•ˆ๊ณ  ๊ฐ€์ง€ ๋งˆ๋ผ. 3์‹œ๊ฐ„ ํˆฌ์žํ•ด์„œ ๊นจ๋—ํ•œ ๊ธฐ๋ฐ˜์„ ๋งŒ๋“ค์–ด๋ผ."* *์ด ๋ฌธ์„œ๋Š” ๊ฐ Phase ์™„๋ฃŒ ์‹œ ๊ฒฐ๊ณผ๋กœ ์—…๋ฐ์ดํŠธํ•  ๊ฒƒ.*