frankenstallm / source /eval /plan /MASTER_PLAN.md
pathcosmos's picture
Upload folder using huggingface_hub (#29)
5b1ff4d
# ๐Ÿ—บ๏ธ MASTER PLAN: ํ•œ๊ตญ์–ด LLM 1B ์žฌํ•™์Šต โ†’ 3B โ†’ ๋ฐฐํฌ
**์ž‘์„ฑ์ผ**: 2026-02-27
**ํ”„๋กœ์ ํŠธ**: `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/`
**๊ฒฐ์ •**: Restart (base checkpoint์—์„œ ํด๋ฆฐ ์žฌํ•™์Šต)
**์ด ์˜ˆ์ƒ ๊ธฐ๊ฐ„**: ~35์‹œ๊ฐ„ (1B: 3์‹œ๊ฐ„ โ†’ 3B pretrain: 26์‹œ๊ฐ„ โ†’ 3B SFT+ํ‰๊ฐ€: 6์‹œ๊ฐ„)
---
## ๐Ÿ“Š ์ „์ฒด ํƒ€์ž„๋ผ์ธ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ
```
Phase 0 โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 30๋ถ„ ๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์ค€๋น„
Phase 1 โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 40๋ถ„ 1B SFT ์žฌํ•™์Šต
Phase 2 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 2์‹œ๊ฐ„ 1B ํ‰๊ฐ€
โ”€โ”€โ”€โ”€โ”€โ”€ ์—ฌ๊ธฐ์„œ ํŒ๋‹จ โ”€โ”€โ”€โ”€โ”€โ”€
Phase 3A โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 3-5์‹œ๊ฐ„ (์กฐ๊ฑด๋ถ€) 1B ์ถ”๊ฐ€ ๊ฐœ์„ 
Phase 3B โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 26์‹œ๊ฐ„ 3B ์‚ฌ์ „ํ•™์Šต
Phase 4 โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 2์‹œ๊ฐ„ 3B SFT
Phase 5 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ 4์‹œ๊ฐ„ ํ‰๊ฐ€ & ๋ฐฐํฌ
```
---
## Phase 0: ์žฌํ•™์Šต ์ง์ „ ์ค€๋น„ (์˜ค๋Š˜, ~30๋ถ„)
### ์ฒดํฌ๋ฆฌ์ŠคํŠธ
#### โ˜ 0-1. ๋ฐ์ดํ„ฐ ์žฌ์ƒ์„ฑ (~20๋ถ„)
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
# prepare_sft_data.py ์žฌ์‹คํ–‰ (๊ฐ•ํ™” ํ•„ํ„ฐ + ์ˆ˜์ •๋œ ๊ฐ€์ค‘์น˜)
python data/prepare_sft_data.py \
--output_dir data/sft_v2/ \
--val_split 0.1
```
**ํ™•์ธ ์‚ฌํ•ญ**:
- ํ•„ํ„ฐ๋ง ํ›„ **120K-135K ์ƒ˜ํ”Œ** ๋‚จ์•„์•ผ ํ•จ (๊ธฐ์กด 159K์—์„œ ์ €ํ’ˆ์งˆ ์ œ๊ฑฐ)
- `</s>` ๋ฆฌํ„ฐ๋Ÿด 113๊ฑด, Q/A ๋งˆ์ปค ~550๊ฑด, ์ž์ฒด๋ฐ˜๋ณต 57๊ฑด ์ œ๊ฑฐ ํ™•์ธ
- OpenOrca ๊ฐ€์ค‘์น˜: 5.0 โ†’ 2.0์œผ๋กœ ๊ฐ์†Œ ํ™•์ธ
- Val split: ~12-13K ์ƒ˜ํ”Œ (10%)
- ์งง์€ output (<80์ž) ์ œ๊ฑฐ ํ™•์ธ
```bash
# ๊ฒฐ๊ณผ ํ™•์ธ
wc -l data/sft_v2/train.jsonl data/sft_v2/val.jsonl
# ์˜ˆ์ƒ: train ~108K-120K, val ~12K-13K
```
**์™„๋ฃŒ ๊ธฐ์ค€**: train 100K+ ์ƒ˜ํ”Œ, val 10K+ ์ƒ˜ํ”Œ. ์ œ๊ฑฐ๋œ ์ƒ˜ํ”Œ spot check ์‹œ ์‹ค์ œ ์ €ํ’ˆ์งˆ.
#### โ˜ 0-2. sft_dataset.py ์ˆ˜์ • ํ™•์ธ (~5๋ถ„)
์ด๋ฏธ ์ˆ˜์ •๋œ ํ•ญ๋ชฉ ํ™•์ธ:
| ์ˆ˜์ • ์‚ฌํ•ญ | ํŒŒ์ผ | ํ™•์ธ |
|-----------|------|------|
| Dynamic padding ์‹ค์ œ ์ž‘๋™ | `data/sft_dataset.py` `__getitem__` | โ˜ ํŒจ๋”ฉ ์—†์ด ์‹ค์ œ ๊ธธ์ด ํ…์„œ ๋ฐ˜ํ™˜ |
| EOS ๋ณด์กด | `data/sft_dataset.py` L130-134 | โ˜ `response_ids[:allowed-1] + [eos_id]` |
| Collate fn | `data/sft_dataset.py` `dynamic_collate_fn` | โ˜ ๋ฐฐ์น˜๋ณ„ ๊ฐ€๋ณ€ ํŒจ๋”ฉ |
```bash
# ํ•ต์‹ฌ ์ฝ”๋“œ ํ™•์ธ
grep -n "allowed_response" data/sft_dataset.py
grep -n "eos_token_id" data/sft_dataset.py
grep -n "torch.full" data/sft_dataset.py # 4096 ๊ณ ์ • ํŒจ๋”ฉ ์—†์–ด์•ผ ํ•จ
```
#### โ˜ 0-3. launch_sft.sh ์ˆ˜์ • (~5๋ถ„)
```bash
# ๋ณ€๊ฒฝํ•  ๊ฐ’๋“ค:
# RUN_NAME=korean_1b_sft_v2
# SFT_DATA=data/sft_v2/train.jsonl
# VAL_DATA=data/sft_v2/val.jsonl
# MAX_STEPS=10000 (3-4 epoch, ๊ธฐ์กด 5000์—์„œ ์ฆ๊ฐ€)
# WARMUP_STEPS=300 (3%)
cp scripts/launch_sft.sh scripts/launch_sft_v2.sh
# ํŽธ์ง‘ ํ›„ diff ํ™•์ธ
```
#### โ˜ 0-4. Sanity Check (~5๋ถ„)
```bash
# 100 steps๋งŒ ๋น ๋ฅด๊ฒŒ ๋Œ๋ ค์„œ ํŒŒ์ดํ”„๋ผ์ธ ์ •์ƒ ํ™•์ธ
bash scripts/launch_sft_v2.sh --max_steps 100
# ํ™•์ธ:
# - Loss๊ฐ€ 2.0-2.5 ๋ฒ”์œ„์—์„œ ์‹œ์ž‘ํ•˜๋Š”๊ฐ€? โœ…
# - ๋ฐฐ์น˜ ๋‚ด ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๊ฐ€๋ณ€์ ์ธ๊ฐ€? (๋กœ๊ทธ์—์„œ ํ™•์ธ) โœ…
# - Val loss๊ฐ€ ์ถœ๋ ฅ๋˜๋Š”๊ฐ€? โœ…
# - OOM ์—†๋Š”๊ฐ€? โœ…
```
**์™„๋ฃŒ ๊ธฐ์ค€**: 100 steps ์—๋Ÿฌ ์—†์ด ์™„๋ฃŒ, loss ํ•ฉ๋ฆฌ์  ๋ฒ”์œ„, val loss ์ถœ๋ ฅ ํ™•์ธ.
---
## Phase 1: 1B SFT ์žฌํ•™์Šต (์˜ค๋Š˜, ~40๋ถ„)
### ์‹คํ–‰ ๋ช…๋ น์–ด
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
RUN_NAME=korean_1b_sft_v2 \
BASE_CHECKPOINT=checkpoints/korean_1b_fp8_run1/checkpoint-0034000 \
SFT_DATA=data/sft_v2/train.jsonl \
VAL_DATA=data/sft_v2/val.jsonl \
MAX_STEPS=10000 \
WARMUP_STEPS=300 \
LR=2.0e-5 \
bash scripts/launch_sft.sh
```
### ๋ชจ๋‹ˆํ„ฐ๋ง
**์‹ค์‹œ๊ฐ„ ๋กœ๊ทธ**:
```bash
tail -f checkpoints/korean_1b_sft_v2/train.log
```
**TensorBoard**:
```bash
tensorboard --logdir checkpoints/korean_1b_sft_v2/tensorboard --port 6007
```
**ํ•ต์‹ฌ ์ˆ˜์น˜**:
| ์ˆ˜์น˜ | ์ •์ƒ ๋ฒ”์œ„ | ๊ฒฝ๊ณ  | ์ฆ‰์‹œ ์ค‘๋‹จ |
|------|----------|------|----------|
| Train Loss | ์‹œ์ž‘ 2.0-2.5, ์ตœ์ข… <1.90 | >2.5 at step 500+ | >3.0 (๋ฐœ์‚ฐ) |
| Val Loss | Train์˜ 1.0-1.1๋ฐฐ | Train์˜ 1.2๋ฐฐ | Train ๋Œ€๋น„ ๊ณ„์† ์ƒ์Šน (๊ณผ์ ํ•ฉ) |
| GNorm | 0.8-1.5 | >2.0 | >5.0 (gradient ํญ๋ฐœ) |
| ํ•™์Šต ์†๋„ | ๊ธฐ์กด ๋Œ€๋น„ 2x+ (dynamic padding ํšจ๊ณผ) | ๊ธฐ์กด๊ณผ ๋น„์Šท | ๊ธฐ์กด๋ณด๋‹ค ๋А๋ฆผ |
**์ฒดํฌํฌ์ธํŠธ ๊ด€์ฐฐ**:
- Step 500: ํŒŒ์ดํ”„๋ผ์ธ ์•ˆ์ •์„ฑ ํ™•์ธ
- Step 2500: ์ค‘๊ฐ„ ์ง€์ , loss ์ถ”์„ธ ํ™•์ธ
- Step 5000: ๊ธฐ์กด ํ•™์Šต๊ณผ ๋น„๊ต (loss < 1.97์ด์–ด์•ผ ํ•จ)
- Step 7500: ์ˆ˜๋ ด ์—ฌ๋ถ€ ํ™•์ธ
- Step 10000: ์ตœ์ข…
### ์„ฑ๊ณต ๊ธฐ์ค€
| ์ง€ํ‘œ | ๋ชฉํ‘œ | ์‹คํŒจ ๊ธฐ์ค€ |
|------|------|----------|
| Final Train Loss | < 1.90 | > 2.00 |
| Final Val Loss | < 2.00 | Train ๋Œ€๋น„ 1.2๋ฐฐ ์ดˆ๊ณผ |
| Val Loss ์ถ”์„ธ | ํ•˜๊ฐ• or ์•ˆ์ • | 3์—ฐ์† ์ƒ์Šน (๊ณผ์ ํ•ฉ) |
| ํ•™์Šต ์‹œ๊ฐ„ | ~40-60๋ถ„ | >2์‹œ๊ฐ„ (dynamic padding ๋ฏธ์ž‘๋™) |
### ์‹คํŒจ ์‹œ ๋Œ€์‘
| ์ƒํ™ฉ | ์›์ธ ์ถ”์ • | ๋Œ€์‘ |
|------|----------|------|
| Loss ๋ฐœ์‚ฐ (>3.0) | LR ๊ณผ๋‹ค or ๋ฐ์ดํ„ฐ ๋ฒ„๊ทธ | LR=1e-5๋กœ ์žฌ์‹œ๋„ |
| OOM | ๋ฐฐ์น˜ ํฌ๊ธฐ ๊ณผ๋‹ค | BATCH_SIZE=2๋กœ ๊ฐ์†Œ |
| Loss ์ •์ฒด (step 2000+ ๋ณ€ํ™” ์—†์Œ) | LR ๋ถ€์กฑ or ๋ฐ์ดํ„ฐ ๋ฌธ์ œ | ๋ฐ์ดํ„ฐ ์ ๊ฒ€, LR=3e-5 ์‹œ๋„ |
| Val Loss ๋ฐœ์‚ฐ (๊ณผ์ ํ•ฉ) | Epoch ๊ณผ๋‹ค | Early stop at best val checkpoint |
| ํ•™์Šต ์†๋„ ๊ธฐ์กด๊ณผ ๊ฐ™์Œ | Dynamic padding ๋ฏธ์ž‘๋™ | sft_dataset.py ์žฌ์ ๊ฒ€ |
---
## Phase 2: 1B SFT ํ‰๊ฐ€ (~2์‹œ๊ฐ„)
### ํ‰๊ฐ€ ์ˆœ์„œ
#### 2-1. ๋ฐ˜๋ณต๋ฅ  ์ธก์ • (30๋ถ„)
```bash
# ์˜ฌ๋ฐ”๋ฅธ ํฌ๋งท(<|user|>/<|assistant|>)์œผ๋กœ ์ƒ์„ฑ ํ…Œ์ŠคํŠธ
python eval/test_generation_params.py \
--checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000
# ๋‹ค์–‘ํ•œ rep_penalty ํ…Œ์ŠคํŠธ
# rep_penalty=1.0 (์—†์Œ): ๋ชฉํ‘œ <10%
# rep_penalty=1.1: ๋ชฉํ‘œ <3%
# rep_penalty=1.2: ๋ชฉํ‘œ <1%
```
#### 2-2. ์ƒ์„ฑ ํ’ˆ์งˆ ์ฃผ๊ด€ ํ‰๊ฐ€ (30๋ถ„)
```bash
python eval/generate.py \
--checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000 \
--prompts_file eval/test_prompts.txt \
--temperature 0.8 --top_p 0.9
```
**์ฒดํฌ**: ํ•œ๊ตญ์–ด ์ž์—ฐ์Šค๋Ÿฌ์›€, instruction following, EOS ์ •์ƒ ์ข…๋ฃŒ
#### 2-3. ๊ณต์‹ ๋ฒค์น˜๋งˆํฌ (1์‹œ๊ฐ„)
```bash
# ko_ifeval
lm_eval --model hf \
--model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \
--tasks ko_ifeval \
--device cuda:0 \
--output_path eval/results/sft_v2_ko_ifeval.json
# ko_winogrande (์„ ํƒ)
lm_eval --model hf \
--model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \
--tasks ko_winogrande \
--device cuda:0 \
--output_path eval/results/sft_v2_ko_winogrande.json
```
### ํŒ๋‹จ ๊ธฐ์ค€ & ๋ถ„๊ธฐ
```
[Phase 2 ํ‰๊ฐ€ ๊ฒฐ๊ณผ]
โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โ”‚ โ”‚
โœ… PASS โš ๏ธ PARTIAL โŒ FAIL
๋ฐ˜๋ณต๋ฅ <5% ๋ฐ˜๋ณต๋ฅ  5-15% ๋ฐ˜๋ณต๋ฅ >15%
ko_ifeval>25% ko_ifeval 15-25% ko_ifeval<15%
โ”‚ โ”‚ โ”‚
โ–ผ โ–ผ โ–ผ
Phase 3B Phase 3A ์›์ธ ๋ถ„์„
(3B ์ „ํ™˜) (์ถ”๊ฐ€ ๊ฐœ์„ ) (๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์žฌ๊ฒ€ํ† )
```
**์ƒ์„ธ ๊ธฐ์ค€**:
| ์ง€ํ‘œ | โœ… Pass | โš ๏ธ ์ถ”๊ฐ€ ์กฐ์ • | โŒ ์žฌํ•™์Šต |
|------|---------|-------------|----------|
| ๋ฐ˜๋ณต๋ฅ  (rep_penalty ์—†์ด) | <10% | 10-20% | >20% |
| ๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) | <5% | 5-15% | >15% |
| ko_ifeval | >25% | 15-25% | <15% |
| EOS ์ •์ƒ ์ข…๋ฃŒ์œจ | >85% | 60-85% | <60% |
---
## Phase 3A: 1B ์ถ”๊ฐ€ ๊ฐœ์„  (์กฐ๊ฑด๋ถ€, ~3-5์‹œ๊ฐ„)
> **Phase 2 ๊ฒฐ๊ณผ๊ฐ€ โš ๏ธ PARTIAL์ผ ๋•Œ๋งŒ ์ง„์ž…**
### ์˜ต์…˜ A: ORPO ํ•™์Šต (~3์‹œ๊ฐ„)
#### Preference Data ์ค€๋น„ (1์‹œ๊ฐ„)
```bash
# ํ•œ๊ตญ์–ด preference ๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ
python -c "
from datasets import load_dataset
# ์˜ต์…˜ 1: ko_Ultrafeedback (60K, ์ผ๋ฐ˜ ๋„๋ฉ”์ธ)
ds = load_dataset('maywell/ko_Ultrafeedback')
# ์˜ต์…˜ 2: ์ž์ฒด ์ƒ์„ฑ (ํ˜„์žฌ ๋ชจ๋ธ๋กœ rejected ์ƒ์„ฑ)
"
```
**์ž์ฒด ์ƒ์„ฑ ๋ฐฉ๋ฒ•**:
1. ํ˜„์žฌ SFT ๋ชจ๋ธ๋กœ ๋™์ผ ํ”„๋กฌํ”„ํŠธ์— ์—ฌ๋Ÿฌ ๋ฒˆ ์ƒ์„ฑ
2. ๋ฐ˜๋ณต/์ €ํ’ˆ์งˆ ์ถœ๋ ฅ โ†’ rejected
3. ๊นจ๋—ํ•œ ๋ฐ์ดํ„ฐ์˜ ์ •๋‹ต โ†’ chosen
4. ~10K-20K ์Œ ์ƒ์„ฑ
#### ORPO ํ•™์Šต (1.5์‹œ๊ฐ„)
```python
from trl import ORPOConfig, ORPOTrainer
config = ORPOConfig(
learning_rate=5e-7,
num_train_epochs=1,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
beta=0.1, # ORPO coefficient
)
trainer = ORPOTrainer(model, config, train_dataset=preference_data)
trainer.train()
```
#### ํ‰๊ฐ€ (30๋ถ„)
- ๋ฐ˜๋ณต๋ฅ  ์žฌ์ธก์ •: ๋ชฉํ‘œ <5% (rep_penalty=1.1)
- ko_ifeval ์žฌ์ธก์ •: ๋ชฉํ‘œ >20%
### ์˜ต์…˜ B: ์ถ”๊ฐ€ SFT (๋ฐ์ดํ„ฐ ๋ณด๊ฐ•, ~5์‹œ๊ฐ„)
#### ์ถ”๊ฐ€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ (2์‹œ๊ฐ„)
```python
from datasets import load_dataset
# ๊ณ ํ’ˆ์งˆ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€
datasets = {
"hPark/orca-ko": 200_000, # ๊ณ ํ’ˆ์งˆ ํ•ฉ์„ฑ
"nayohan/llama3-instruct-ko-dataset": 58_000, # Llama3 ํ•œ๊ตญ์–ด
"FreedomIntelligence/evol-instruct-korean": 70_000, # GPT-4 ์ƒ์„ฑ
}
# ๊ธฐ์กด 120K + ์ถ”๊ฐ€ ~300K โ†’ ํ•„ํ„ฐ ํ›„ ~350K
```
#### ์žฌํ•™์Šต (2์‹œ๊ฐ„)
```bash
# ์ฆ๊ฐ€๋œ ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šต
RUN_NAME=korean_1b_sft_v3 \
SFT_DATA=data/sft_v3/train.jsonl \
MAX_STEPS=15000 \
bash scripts/launch_sft.sh
```
### Phase 3A ์„ฑ๊ณต ๊ธฐ์ค€
| ์ง€ํ‘œ | ๋ชฉํ‘œ |
|------|------|
| ๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) | <5% |
| ko_ifeval | >20% |
**์‹คํŒจ ์‹œ**: 1B ํ•œ๊ณ„ ์ธ์ •, Phase 3B (3B ์ „ํ™˜)๋กœ ๋ฐ”๋กœ ์ด๋™.
---
## Phase 3B: 3B ์‚ฌ์ „ํ•™์Šต (Phase 2 ํ†ต๊ณผ ํ›„, ~26์‹œ๊ฐ„)
### 3B ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
| ํŒŒ๋ผ๋ฏธํ„ฐ | 1B (ํ˜„์žฌ) | 3B (๋ชฉํ‘œ) | ๋น„๊ณ  |
|---------|----------|----------|------|
| d_model | 2048 | 2560 | ~1.25x |
| n_layers | 24 | 32 | ~1.33x |
| n_heads | 16 | 32 | 2x |
| n_kv_heads (GQA) | 4 | 8 | 2x |
| d_ffn | 5472 | 6912 | ~1.26x |
| vocab_size | 64000 | 64000 | ๋™์ผ |
| max_seq_len | 4096 | 4096 | ๋™์ผ |
| **์ด ํŒŒ๋ผ๋ฏธํ„ฐ** | **1.19B** | **~3.0B** | ~2.5x |
### ์„ค์ • ํŒŒ์ผ ์ž‘์„ฑ
```bash
# configs/korean_3b_fp8.yaml ์ž‘์„ฑ
cat > configs/korean_3b_fp8.yaml << 'EOF'
model:
d_model: 2560
n_layers: 32
n_heads: 32
n_kv_heads: 8
d_ffn: 6912
vocab_size: 64000
max_seq_len: 4096
rope_theta: 500000
training:
lr: 3.0e-4
min_lr: 3.0e-5
warmup_steps: 2000
max_steps: 100000
batch_size: 4
grad_accum: 4
weight_decay: 0.1
use_fp8: true
data:
sources:
- cc100_ko
- culturax_ko
- existing_pretrain
EOF
```
### ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ
| ์†Œ์Šค | ํ† ํฐ ์ˆ˜ | ์ƒํƒœ |
|------|---------|------|
| CulturaX ko | 24.8B | โœ… ๋ณด์œ  |
| cc100 ko (์žฌ์ˆ˜์ง‘) | ~65-100B | โš ๏ธ ์žฌ์ˆ˜์ง‘ ํ•„์š” (๋…ธ์ด์ฆˆ ํ•„ํ„ฐ๋ง) |
| ๊ธฐ์กด pretrain ๋ฐ์ดํ„ฐ | ~8.9B | โœ… ๋ณด์œ  |
| ์ถ”๊ฐ€ ์ˆ˜์ง‘ (๋‚˜๋ฌด์œ„ํ‚ค, ๋‰ด์Šค ๋“ฑ) | ~20-50B | ์„ ํƒ์  |
| **ํ•ฉ๊ณ„** | **~120-180B** | Chinchilla 60B ์ตœ์†Œ ์ถฉ์กฑ |
**๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ช…๋ น์–ด**:
```bash
# cc100 ์žฌ์ˆ˜์ง‘ + ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง
python scripts/download_cc100_ko.py --quality_filter --dedup
# MinHash dedup + perplexity filter
python scripts/quality_filter.py --input data/pretrain/ --max_ppl 1000
```
### ํ•™์Šต ์‹คํ–‰
```bash
# 3B pretrain ์‹œ์ž‘ (8ร— B200, ~26์‹œ๊ฐ„)
bash scripts/run_pretrain.sh --config configs/korean_3b_fp8.yaml
# ์˜ˆ์ƒ ์ฒ˜๋ฆฌ ์†๋„: ~1.6M tok/s (8ร— B200)
# 150B tokens / 1.6M tok/s โ‰ˆ 26์‹œ๊ฐ„
```
### ๋ชจ๋‹ˆํ„ฐ๋ง
```bash
# ๋กœ๊ทธ ํ™•์ธ
tail -f checkpoints/korean_3b_fp8/train.log
# ์ค‘๊ฐ„ ์ฒดํฌํฌ์ธํŠธ์—์„œ base ํ’ˆ์งˆ ํ™•์ธ (step 10000๋งˆ๋‹ค)
python eval/perplexity.py --checkpoint checkpoints/korean_3b_fp8/checkpoint-0010000
```
**์„ฑ๊ณต ๊ธฐ์ค€**: PPL < 10 (ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ), loss ์ง€์† ํ•˜๊ฐ•
---
## Phase 4: 3B SFT (~2์‹œ๊ฐ„)
### 1B์—์„œ ๋ฐฐ์šด ๊ตํ›ˆ ์ „๋ถ€ ์ ์šฉ
| ๊ตํ›ˆ | ์ ์šฉ |
|------|------|
| Dynamic padding ์ž‘๋™ ํ™•์ธ | โœ… sft_dataset.py ์ˆ˜์ • ์™„๋ฃŒ, ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ |
| EOS ๋ณด์กด | โœ… ๋™์ผ ์ฝ”๋“œ |
| Val split ํ•„์ˆ˜ | โœ… 10% split |
| 3-4 epoch | โœ… MAX_STEPS ๊ณ„์‚ฐํ•˜์—ฌ ์„ค์ • |
| OpenOrca ๊ณผ๋‹ค ๊ฐ€์ค‘์น˜ ๋ฐฉ์ง€ | โœ… 2.0x ์ดํ•˜ |
| ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ•„ํ„ฐ๋ง | โœ… Phase 0์—์„œ ์ƒ์„ฑํ•œ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ |
| ์˜ฌ๋ฐ”๋ฅธ ํ”„๋กฌํ”„ํŠธ ํฌ๋งท | โœ… `<\|user\|>/<\|assistant\|>` |
### ์‹คํ–‰
```bash
RUN_NAME=korean_3b_sft \
BASE_CHECKPOINT=checkpoints/korean_3b_fp8/checkpoint-BEST \
SFT_DATA=data/sft_v2/train.jsonl \
VAL_DATA=data/sft_v2/val.jsonl \
MAX_STEPS=10000 \
LR=2.0e-5 \
WARMUP_STEPS=300 \
bash scripts/launch_sft.sh
```
**์˜ˆ์ƒ ์‹œ๊ฐ„**: ~2์‹œ๊ฐ„ (3B๋Š” 1B ๋Œ€๋น„ ~2.5x ๋А๋ฆผ)
### ์„ฑ๊ณต ๊ธฐ์ค€
| ์ง€ํ‘œ | ๋ชฉํ‘œ |
|------|------|
| Train Loss | < 1.85 |
| Val Loss | Train์˜ 1.1๋ฐฐ ์ด๋‚ด |
| ๋ฐ˜๋ณต๋ฅ  (rep_penalty ์—†์ด) | < 10% |
| ๋ฐ˜๋ณต๋ฅ  (rep_penalty=1.1) | < 3% |
---
## Phase 5: ํ‰๊ฐ€ ๋ฐ ๋ฐฐํฌ (~4์‹œ๊ฐ„)
### 5-1. ์ „์ฒด ๋ฒค์น˜๋งˆํฌ (~2์‹œ๊ฐ„)
```bash
# ko_ifeval
lm_eval --model hf \
--model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
--tasks ko_ifeval --device cuda:0
# ko_winogrande
lm_eval --model hf \
--model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
--tasks ko_winogrande --device cuda:0
# KoBEST (์„ ํƒ)
lm_eval --model hf \
--model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \
--tasks kobest_boolq,kobest_copa,kobest_wic,kobest_hellaswag,kobest_sentineg \
--device cuda:0
```
**3B ๋ชฉํ‘œ ์ˆ˜์น˜**:
| ๋ฒค์น˜๋งˆํฌ | 1B ์˜ˆ์ƒ | 3B ๋ชฉํ‘œ |
|---------|---------|---------|
| ko_ifeval | 20-30% | **35-45%** |
| ko_winogrande | 53-58% | **60-68%** |
| KoBEST (avg) | 55-60% | **65-75%** |
| ๋ฐ˜๋ณต๋ฅ  | <5% | **<3%** |
### 5-2. HuggingFace Hub ์—…๋กœ๋“œ (~1์‹œ๊ฐ„)
```bash
# HF ํฌ๋งท ๋ณ€ํ™˜
python scripts/convert_to_hf.py \
--checkpoint checkpoints/korean_3b_sft/checkpoint-BEST \
--output_dir hf_models/korean-3b-instruct
# Model card ์ž‘์„ฑ
cat > hf_models/korean-3b-instruct/README.md << 'EOF'
---
language: ko
license: apache-2.0
tags:
- korean
- llm
- instruction-tuning
---
# Korean 3B Instruct
...๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ, ์‚ฌ์šฉ๋ฒ• ๋“ฑ...
EOF
# ์—…๋กœ๋“œ
huggingface-cli upload ghong/korean-3b-instruct hf_models/korean-3b-instruct
```
### 5-3. vLLM ์„œ๋น™ ์„ค์ • (~1์‹œ๊ฐ„)
```bash
# vLLM ์„œ๋ฒ„ ์‹œ์ž‘
python -m vllm.entrypoints.openai.api_server \
--model hf_models/korean-3b-instruct \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--max-model-len 4096 \
--port 8000
# ํ…Œ์ŠคํŠธ
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "korean-3b-instruct",
"messages": [{"role": "user", "content": "ํ•œ๊ตญ์˜ ์ˆ˜๋„๋Š”?"}],
"temperature": 0.7
}'
```
**FP8 ์„œ๋น™ (B200 ์ตœ์ )**:
```bash
python -m vllm.entrypoints.openai.api_server \
--model hf_models/korean-3b-instruct \
--quantization fp8 \
--tensor-parallel-size 1 \
--max-model-len 4096
```
**GGUF ๋ณ€ํ™˜ (Ollama/๋กœ์ปฌ ๋ฐฐํฌ)**:
```bash
bash scripts/convert_to_gguf.sh checkpoints/korean_3b_sft/checkpoint-BEST
# Ollama Modelfile ์ž‘์„ฑ ํ›„
ollama create korean-3b -f Modelfile
```
---
## ๐Ÿ“‹ Phase๋ณ„ ์š”์•ฝ ํ…Œ์ด๋ธ”
| Phase | ์†Œ์š” ์‹œ๊ฐ„ | ํ•„์š”ํ•œ ๊ฒƒ | ์„ฑ๊ณต ๊ธฐ์ค€ | ์‹คํŒจ ์‹œ |
|-------|----------|----------|----------|---------|
| **0: ์ค€๋น„** | 30๋ถ„ | prepare_sft_data.py, sft_dataset.py ์ˆ˜์ • | ํด๋ฆฐ ๋ฐ์ดํ„ฐ 120K+, sanity 100steps ํ†ต๊ณผ | ์ฝ”๋“œ ๋””๋ฒ„๊ทธ |
| **1: 1B SFT** | 40๋ถ„ | 8ร—B200, ํด๋ฆฐ ๋ฐ์ดํ„ฐ, ์ˆ˜์ •๋œ ์ฝ”๋“œ | Loss<1.90, ValLoss ์•ˆ์ • | LR ์กฐ์ • or ๋ฐ์ดํ„ฐ ์žฌ์ ๊ฒ€ |
| **2: 1B ํ‰๊ฐ€** | 2์‹œ๊ฐ„ | lm-eval-harness, ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ | ๋ฐ˜๋ณต๋ฅ <5%, ko_ifeval>25% | Phase 3A |
| **3A: ์ถ”๊ฐ€๊ฐœ์„ ** | 3-5์‹œ๊ฐ„ | Preference ๋ฐ์ดํ„ฐ, ORPO/์ถ”๊ฐ€ SFT | ๋ฐ˜๋ณต๋ฅ <5% ๋‹ฌ์„ฑ | 1B ํ•œ๊ณ„ ์ธ์ •โ†’3B |
| **3B: 3B PT** | 26์‹œ๊ฐ„ | 150B+ ํ† ํฐ, configs/korean_3b_fp8.yaml | PPL<10, loss ํ•˜๊ฐ• | ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ or ์•„ํ‚คํ…์ฒ˜ ์กฐ์ • |
| **4: 3B SFT** | 2์‹œ๊ฐ„ | Phase 0์˜ ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ | Loss<1.85, ๋ฐ˜๋ณต๋ฅ <3% | LR/epoch ์กฐ์ • |
| **5: ๋ฐฐํฌ** | 4์‹œ๊ฐ„ | HF ๊ณ„์ •, vLLM | ko_ifeval>35%, ์„œ๋น™ ์ •์ƒ | ๋ชจ๋ธ ๊ฐœ์„  ํ›„ ์žฌ๋ฐฐํฌ |
---
## ๐Ÿ”ฅ ์˜ค๋Š˜ ๋‹น์žฅ ์‹œ์ž‘ํ•  ์ฒซ ๋ฒˆ์งธ ๋ช…๋ น์–ด
```bash
cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang
python data/prepare_sft_data.py --output_dir data/sft_v2/ --val_split 0.1
```
์ด ๋ช…๋ น์–ด ํ•˜๋‚˜๋กœ Phase 0์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ž‘์—…(ํด๋ฆฐ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)์ด ์‹œ์ž‘๋œ๋‹ค.
---
## โšก ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŒ๋‹จ ํฌ์ธํŠธ
### 1์ฐจ ํŒ๋‹จ: Phase 1 ์™„๋ฃŒ ํ›„ (Step 10000)
- **Val Loss๊ฐ€ Train Loss์˜ 1.2๋ฐฐ ์ด์ƒ?** โ†’ ๊ณผ์ ํ•ฉ. Best checkpoint ์‚ฌ์šฉ.
- **Train Loss > 2.0?** โ†’ ๋ฌด์–ธ๊ฐ€ ์ž˜๋ชป๋จ. ์ฝ”๋“œ/๋ฐ์ดํ„ฐ ์žฌ์ ๊ฒ€.
### 2์ฐจ ํŒ๋‹จ: Phase 2 ํ‰๊ฐ€ ํ›„ (๊ฐ€์žฅ ์ค‘์š”!)
- **๋ฐ˜๋ณต๋ฅ  <5% AND ko_ifeval >25%?** โ†’ โœ… 3B ์ „ํ™˜ (Phase 3B)
- **๋ฐ˜๋ณต๋ฅ  5-15%?** โ†’ โš ๏ธ ORPO ์‹œ๋„ (Phase 3A)
- **๋ฐ˜๋ณต๋ฅ  >15%?** โ†’ โŒ ์›์ธ ๋ถ„์„. ๋ฐ์ดํ„ฐ/์ฝ”๋“œ ์žฌ๊ฒ€ํ† .
### 3์ฐจ ํŒ๋‹จ: Phase 3B ์ค‘๊ฐ„ (3B pretrain step 50000)
- **Loss ํ•˜๊ฐ• ๋ฉˆ์ถค?** โ†’ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ. ํ•„ํ„ฐ๋ง ๊ฐ•ํ™”.
- **PPL > 15?** โ†’ ๋ฐ์ดํ„ฐ ๋ถ€์กฑ. ์ถ”๊ฐ€ ์ˆ˜์ง‘ ํ•„์š”.
---
## ๐Ÿ›ก๏ธ ๋ฆฌ์Šคํฌ ๋งคํŠธ๋ฆญ์Šค
| ๋ฆฌ์Šคํฌ | ํ™•๋ฅ  | ์˜ํ–ฅ | ์˜ˆ๋ฐฉ/๋Œ€์‘ |
|--------|------|------|----------|
| Dynamic padding ์—ฌ์ „ํžˆ ๋ฏธ์ž‘๋™ | 10% | ๋†’์Œ (์†๋„ 3-8x ๋‚ญ๋น„) | Sanity check์—์„œ ๋ฐฐ์น˜ ๊ธธ์ด ํ™•์ธ |
| ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ๊ณผ๋‹ค (100K ๋ฏธ๋งŒ) | 15% | ์ค‘๊ฐ„ | ํ•„ํ„ฐ ๊ธฐ์ค€ ์™„ํ™” (80์žโ†’50์ž) |
| 1B ์žฌํ•™์Šต ํ›„์—๋„ ๋ฐ˜๋ณต >15% | 15% | ์ค‘๊ฐ„ | ORPO or 3B ์ „ํ™˜ |
| 3B pretrain ์ค‘ OOM | 10% | ๋†’์Œ | batch_size ์ค„์ด๊ธฐ, gradient checkpointing |
| cc100 ์žฌ์ˆ˜์ง‘ ์‹œ๊ฐ„ ์ดˆ๊ณผ | 20% | ๋‚ฎ์Œ | CulturaX๋งŒ์œผ๋กœ ์‹œ์ž‘ (24.8B) |
| ๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ | 5% | ๋†’์Œ | ํ˜„์žฌ 19TB ๊ฐ€์šฉ, ์ถฉ๋ถ„ |
---
*"40๋ถ„ ์•„๋ผ๋ ค๊ณ  ๊ธฐ์ˆ  ๋ถ€์ฑ„๋ฅผ ์•ˆ๊ณ  ๊ฐ€์ง€ ๋งˆ๋ผ. 3์‹œ๊ฐ„ ํˆฌ์žํ•ด์„œ ๊นจ๋—ํ•œ ๊ธฐ๋ฐ˜์„ ๋งŒ๋“ค์–ด๋ผ."*
*์ด ๋ฌธ์„œ๋Š” ๊ฐ Phase ์™„๋ฃŒ ์‹œ ๊ฒฐ๊ณผ๋กœ ์—…๋ฐ์ดํŠธํ•  ๊ฒƒ.*