| # ๐บ๏ธ MASTER PLAN: ํ๊ตญ์ด LLM 1B ์ฌํ์ต โ 3B โ ๋ฐฐํฌ |
|
|
| **์์ฑ์ผ**: 2026-02-27 |
| **ํ๋ก์ ํธ**: `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/` |
| **๊ฒฐ์ **: Restart (base checkpoint์์ ํด๋ฆฐ ์ฌํ์ต) |
| **์ด ์์ ๊ธฐ๊ฐ**: ~35์๊ฐ (1B: 3์๊ฐ โ 3B pretrain: 26์๊ฐ โ 3B SFT+ํ๊ฐ: 6์๊ฐ) |
|
|
| --- |
|
|
| ## ๐ ์ ์ฒด ํ์๋ผ์ธ ํ๋์ ๋ณด๊ธฐ |
|
|
| ``` |
| Phase 0 โโโโโโโโโโโโโโโโโโโโโโโโ 30๋ถ ๋ฐ์ดํฐ/์ฝ๋ ์ค๋น |
| Phase 1 โโโโโโโโโโโโโโโโโโโโโโโโ 40๋ถ 1B SFT ์ฌํ์ต |
| Phase 2 โโโโโโโโโโโโโโโโโโโโโโโโ 2์๊ฐ 1B ํ๊ฐ |
| โโโโโโ ์ฌ๊ธฐ์ ํ๋จ โโโโโโ |
| Phase 3A โโโโโโโโโโโโโโโโโโโโโโโโ 3-5์๊ฐ (์กฐ๊ฑด๋ถ) 1B ์ถ๊ฐ ๊ฐ์ |
| Phase 3B โโโโโโโโโโโโโโโโโโโโโโโโ 26์๊ฐ 3B ์ฌ์ ํ์ต |
| Phase 4 โโโโโโโโโโโโโโโโโโโโโโโโ 2์๊ฐ 3B SFT |
| Phase 5 โโโโโโโโโโโโโโโโโโโโโโโโ 4์๊ฐ ํ๊ฐ & ๋ฐฐํฌ |
| ``` |
|
|
| --- |
|
|
| ## Phase 0: ์ฌํ์ต ์ง์ ์ค๋น (์ค๋, ~30๋ถ) |
|
|
| ### ์ฒดํฌ๋ฆฌ์คํธ |
|
|
| #### โ 0-1. ๋ฐ์ดํฐ ์ฌ์์ฑ (~20๋ถ) |
| ```bash |
| cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang |
| |
| # prepare_sft_data.py ์ฌ์คํ (๊ฐํ ํํฐ + ์์ ๋ ๊ฐ์ค์น) |
| python data/prepare_sft_data.py \ |
| --output_dir data/sft_v2/ \ |
| --val_split 0.1 |
| ``` |
|
|
| **ํ์ธ ์ฌํญ**: |
| - ํํฐ๋ง ํ **120K-135K ์ํ** ๋จ์์ผ ํจ (๊ธฐ์กด 159K์์ ์ ํ์ง ์ ๊ฑฐ) |
| - `</s>` ๋ฆฌํฐ๋ด 113๊ฑด, Q/A ๋ง์ปค ~550๊ฑด, ์์ฒด๋ฐ๋ณต 57๊ฑด ์ ๊ฑฐ ํ์ธ |
| - OpenOrca ๊ฐ์ค์น: 5.0 โ 2.0์ผ๋ก ๊ฐ์ ํ์ธ |
| - Val split: ~12-13K ์ํ (10%) |
| - ์งง์ output (<80์) ์ ๊ฑฐ ํ์ธ |
|
|
| ```bash |
| # ๊ฒฐ๊ณผ ํ์ธ |
| wc -l data/sft_v2/train.jsonl data/sft_v2/val.jsonl |
| # ์์: train ~108K-120K, val ~12K-13K |
| ``` |
|
|
| **์๋ฃ ๊ธฐ์ค**: train 100K+ ์ํ, val 10K+ ์ํ. ์ ๊ฑฐ๋ ์ํ spot check ์ ์ค์ ์ ํ์ง. |
|
|
| #### โ 0-2. sft_dataset.py ์์ ํ์ธ (~5๋ถ) |
| |
| ์ด๋ฏธ ์์ ๋ ํญ๋ชฉ ํ์ธ: |
| |
| | ์์ ์ฌํญ | ํ์ผ | ํ์ธ | |
| |-----------|------|------| |
| | Dynamic padding ์ค์ ์๋ | `data/sft_dataset.py` `__getitem__` | โ ํจ๋ฉ ์์ด ์ค์ ๊ธธ์ด ํ
์ ๋ฐํ | |
| | EOS ๋ณด์กด | `data/sft_dataset.py` L130-134 | โ `response_ids[:allowed-1] + [eos_id]` | |
| | Collate fn | `data/sft_dataset.py` `dynamic_collate_fn` | โ ๋ฐฐ์น๋ณ ๊ฐ๋ณ ํจ๋ฉ | |
|
|
| ```bash |
| # ํต์ฌ ์ฝ๋ ํ์ธ |
| grep -n "allowed_response" data/sft_dataset.py |
| grep -n "eos_token_id" data/sft_dataset.py |
| grep -n "torch.full" data/sft_dataset.py # 4096 ๊ณ ์ ํจ๋ฉ ์์ด์ผ ํจ |
| ``` |
|
|
| #### โ 0-3. launch_sft.sh ์์ (~5๋ถ) |
| |
| ```bash |
| # ๋ณ๊ฒฝํ ๊ฐ๋ค: |
| # RUN_NAME=korean_1b_sft_v2 |
| # SFT_DATA=data/sft_v2/train.jsonl |
| # VAL_DATA=data/sft_v2/val.jsonl |
| # MAX_STEPS=10000 (3-4 epoch, ๊ธฐ์กด 5000์์ ์ฆ๊ฐ) |
| # WARMUP_STEPS=300 (3%) |
| |
| cp scripts/launch_sft.sh scripts/launch_sft_v2.sh |
| # ํธ์ง ํ diff ํ์ธ |
| ``` |
| |
| #### โ 0-4. Sanity Check (~5๋ถ) |
| |
| ```bash |
| # 100 steps๋ง ๋น ๋ฅด๊ฒ ๋๋ ค์ ํ์ดํ๋ผ์ธ ์ ์ ํ์ธ |
| bash scripts/launch_sft_v2.sh --max_steps 100 |
| |
| # ํ์ธ: |
| # - Loss๊ฐ 2.0-2.5 ๋ฒ์์์ ์์ํ๋๊ฐ? โ
|
| # - ๋ฐฐ์น ๋ด ์ํ์ค ๊ธธ์ด๊ฐ ๊ฐ๋ณ์ ์ธ๊ฐ? (๋ก๊ทธ์์ ํ์ธ) โ
|
| # - Val loss๊ฐ ์ถ๋ ฅ๋๋๊ฐ? โ
|
| # - OOM ์๋๊ฐ? โ
|
| ``` |
| |
| **์๋ฃ ๊ธฐ์ค**: 100 steps ์๋ฌ ์์ด ์๋ฃ, loss ํฉ๋ฆฌ์ ๋ฒ์, val loss ์ถ๋ ฅ ํ์ธ. |
| |
| --- |
| |
| ## Phase 1: 1B SFT ์ฌํ์ต (์ค๋, ~40๋ถ) |
| |
| ### ์คํ ๋ช
๋ น์ด |
| |
| ```bash |
| cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang |
|
|
| RUN_NAME=korean_1b_sft_v2 \ |
| BASE_CHECKPOINT=checkpoints/korean_1b_fp8_run1/checkpoint-0034000 \ |
| SFT_DATA=data/sft_v2/train.jsonl \ |
| VAL_DATA=data/sft_v2/val.jsonl \ |
| MAX_STEPS=10000 \ |
| WARMUP_STEPS=300 \ |
| LR=2.0e-5 \ |
| bash scripts/launch_sft.sh |
| ``` |
| |
| ### ๋ชจ๋ํฐ๋ง |
| |
| **์ค์๊ฐ ๋ก๊ทธ**: |
| ```bash |
| tail -f checkpoints/korean_1b_sft_v2/train.log |
| ``` |
| |
| **TensorBoard**: |
| ```bash |
| tensorboard --logdir checkpoints/korean_1b_sft_v2/tensorboard --port 6007 |
| ``` |
| |
| **ํต์ฌ ์์น**: |
| |
| | ์์น | ์ ์ ๋ฒ์ | ๊ฒฝ๊ณ | ์ฆ์ ์ค๋จ | |
| |------|----------|------|----------| |
| | Train Loss | ์์ 2.0-2.5, ์ต์ข
<1.90 | >2.5 at step 500+ | >3.0 (๋ฐ์ฐ) | |
| | Val Loss | Train์ 1.0-1.1๋ฐฐ | Train์ 1.2๋ฐฐ | Train ๋๋น ๊ณ์ ์์น (๊ณผ์ ํฉ) | |
| | GNorm | 0.8-1.5 | >2.0 | >5.0 (gradient ํญ๋ฐ) | |
| | ํ์ต ์๋ | ๊ธฐ์กด ๋๋น 2x+ (dynamic padding ํจ๊ณผ) | ๊ธฐ์กด๊ณผ ๋น์ท | ๊ธฐ์กด๋ณด๋ค ๋๋ฆผ | |
| |
| **์ฒดํฌํฌ์ธํธ ๊ด์ฐฐ**: |
| - Step 500: ํ์ดํ๋ผ์ธ ์์ ์ฑ ํ์ธ |
| - Step 2500: ์ค๊ฐ ์ง์ , loss ์ถ์ธ ํ์ธ |
| - Step 5000: ๊ธฐ์กด ํ์ต๊ณผ ๋น๊ต (loss < 1.97์ด์ด์ผ ํจ) |
| - Step 7500: ์๋ ด ์ฌ๋ถ ํ์ธ |
| - Step 10000: ์ต์ข
|
| |
| ### ์ฑ๊ณต ๊ธฐ์ค |
| |
| | ์งํ | ๋ชฉํ | ์คํจ ๊ธฐ์ค | |
| |------|------|----------| |
| | Final Train Loss | < 1.90 | > 2.00 | |
| | Final Val Loss | < 2.00 | Train ๋๋น 1.2๋ฐฐ ์ด๊ณผ | |
| | Val Loss ์ถ์ธ | ํ๊ฐ or ์์ | 3์ฐ์ ์์น (๊ณผ์ ํฉ) | |
| | ํ์ต ์๊ฐ | ~40-60๋ถ | >2์๊ฐ (dynamic padding ๋ฏธ์๋) | |
| |
| ### ์คํจ ์ ๋์ |
| |
| | ์ํฉ | ์์ธ ์ถ์ | ๋์ | |
| |------|----------|------| |
| | Loss ๋ฐ์ฐ (>3.0) | LR ๊ณผ๋ค or ๋ฐ์ดํฐ ๋ฒ๊ทธ | LR=1e-5๋ก ์ฌ์๋ | |
| | OOM | ๋ฐฐ์น ํฌ๊ธฐ ๊ณผ๋ค | BATCH_SIZE=2๋ก ๊ฐ์ | |
| | Loss ์ ์ฒด (step 2000+ ๋ณํ ์์) | LR ๋ถ์กฑ or ๋ฐ์ดํฐ ๋ฌธ์ | ๋ฐ์ดํฐ ์ ๊ฒ, LR=3e-5 ์๋ | |
| | Val Loss ๋ฐ์ฐ (๊ณผ์ ํฉ) | Epoch ๊ณผ๋ค | Early stop at best val checkpoint | |
| | ํ์ต ์๋ ๊ธฐ์กด๊ณผ ๊ฐ์ | Dynamic padding ๋ฏธ์๋ | sft_dataset.py ์ฌ์ ๊ฒ | |
| |
| --- |
| |
| ## Phase 2: 1B SFT ํ๊ฐ (~2์๊ฐ) |
| |
| ### ํ๊ฐ ์์ |
| |
| #### 2-1. ๋ฐ๋ณต๋ฅ ์ธก์ (30๋ถ) |
| |
| ```bash |
| # ์ฌ๋ฐ๋ฅธ ํฌ๋งท(<|user|>/<|assistant|>)์ผ๋ก ์์ฑ ํ
์คํธ |
| python eval/test_generation_params.py \ |
| --checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000 |
|
|
| # ๋ค์ํ rep_penalty ํ
์คํธ |
| # rep_penalty=1.0 (์์): ๋ชฉํ <10% |
| # rep_penalty=1.1: ๋ชฉํ <3% |
| # rep_penalty=1.2: ๋ชฉํ <1% |
| ``` |
| |
| #### 2-2. ์์ฑ ํ์ง ์ฃผ๊ด ํ๊ฐ (30๋ถ) |
| |
| ```bash |
| python eval/generate.py \ |
| --checkpoint checkpoints/korean_1b_sft_v2/checkpoint-0010000 \ |
| --prompts_file eval/test_prompts.txt \ |
| --temperature 0.8 --top_p 0.9 |
| ``` |
| |
| **์ฒดํฌ**: ํ๊ตญ์ด ์์ฐ์ค๋ฌ์, instruction following, EOS ์ ์ ์ข
๋ฃ |
|
|
| #### 2-3. ๊ณต์ ๋ฒค์น๋งํฌ (1์๊ฐ) |
|
|
| ```bash |
| # ko_ifeval |
| lm_eval --model hf \ |
| --model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \ |
| --tasks ko_ifeval \ |
| --device cuda:0 \ |
| --output_path eval/results/sft_v2_ko_ifeval.json |
| |
| # ko_winogrande (์ ํ) |
| lm_eval --model hf \ |
| --model_args pretrained=checkpoints/korean_1b_sft_v2/checkpoint-0010000,dtype=bfloat16 \ |
| --tasks ko_winogrande \ |
| --device cuda:0 \ |
| --output_path eval/results/sft_v2_ko_winogrande.json |
| ``` |
|
|
| ### ํ๋จ ๊ธฐ์ค & ๋ถ๊ธฐ |
|
|
| ``` |
| [Phase 2 ํ๊ฐ ๊ฒฐ๊ณผ] |
| โ |
| โโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ |
| โ โ โ |
| โ
PASS โ ๏ธ PARTIAL โ FAIL |
| ๋ฐ๋ณต๋ฅ <5% ๋ฐ๋ณต๋ฅ 5-15% ๋ฐ๋ณต๋ฅ >15% |
| ko_ifeval>25% ko_ifeval 15-25% ko_ifeval<15% |
| โ โ โ |
| โผ โผ โผ |
| Phase 3B Phase 3A ์์ธ ๋ถ์ |
| (3B ์ ํ) (์ถ๊ฐ ๊ฐ์ ) (๋ฐ์ดํฐ/์ฝ๋ ์ฌ๊ฒํ ) |
| ``` |
|
|
| **์์ธ ๊ธฐ์ค**: |
|
|
| | ์งํ | โ
Pass | โ ๏ธ ์ถ๊ฐ ์กฐ์ | โ ์ฌํ์ต | |
| |------|---------|-------------|----------| |
| | ๋ฐ๋ณต๋ฅ (rep_penalty ์์ด) | <10% | 10-20% | >20% | |
| | ๋ฐ๋ณต๋ฅ (rep_penalty=1.1) | <5% | 5-15% | >15% | |
| | ko_ifeval | >25% | 15-25% | <15% | |
| | EOS ์ ์ ์ข
๋ฃ์จ | >85% | 60-85% | <60% | |
| |
| --- |
| |
| ## Phase 3A: 1B ์ถ๊ฐ ๊ฐ์ (์กฐ๊ฑด๋ถ, ~3-5์๊ฐ) |
| |
| > **Phase 2 ๊ฒฐ๊ณผ๊ฐ โ ๏ธ PARTIAL์ผ ๋๋ง ์ง์
** |
| |
| ### ์ต์
A: ORPO ํ์ต (~3์๊ฐ) |
| |
| #### Preference Data ์ค๋น (1์๊ฐ) |
| ```bash |
| # ํ๊ตญ์ด preference ๋ฐ์ดํฐ ๋ค์ด๋ก๋ |
| python -c " |
| from datasets import load_dataset |
| # ์ต์
1: ko_Ultrafeedback (60K, ์ผ๋ฐ ๋๋ฉ์ธ) |
| ds = load_dataset('maywell/ko_Ultrafeedback') |
| # ์ต์
2: ์์ฒด ์์ฑ (ํ์ฌ ๋ชจ๋ธ๋ก rejected ์์ฑ) |
| " |
| ``` |
| |
| **์์ฒด ์์ฑ ๋ฐฉ๋ฒ**: |
| 1. ํ์ฌ SFT ๋ชจ๋ธ๋ก ๋์ผ ํ๋กฌํํธ์ ์ฌ๋ฌ ๋ฒ ์์ฑ |
| 2. ๋ฐ๋ณต/์ ํ์ง ์ถ๋ ฅ โ rejected |
| 3. ๊นจ๋ํ ๋ฐ์ดํฐ์ ์ ๋ต โ chosen |
| 4. ~10K-20K ์ ์์ฑ |
| |
| #### ORPO ํ์ต (1.5์๊ฐ) |
| ```python |
| from trl import ORPOConfig, ORPOTrainer |
| |
| config = ORPOConfig( |
| learning_rate=5e-7, |
| num_train_epochs=1, |
| per_device_train_batch_size=4, |
| gradient_accumulation_steps=2, |
| beta=0.1, # ORPO coefficient |
| ) |
| trainer = ORPOTrainer(model, config, train_dataset=preference_data) |
| trainer.train() |
| ``` |
| |
| #### ํ๊ฐ (30๋ถ) |
| - ๋ฐ๋ณต๋ฅ ์ฌ์ธก์ : ๋ชฉํ <5% (rep_penalty=1.1) |
| - ko_ifeval ์ฌ์ธก์ : ๋ชฉํ >20% |
|
|
| ### ์ต์
B: ์ถ๊ฐ SFT (๋ฐ์ดํฐ ๋ณด๊ฐ, ~5์๊ฐ) |
|
|
| #### ์ถ๊ฐ ๋ฐ์ดํฐ ์์ง (2์๊ฐ) |
| ```python |
| from datasets import load_dataset |
| |
| # ๊ณ ํ์ง ํ๊ตญ์ด ๋ฐ์ดํฐ ์ถ๊ฐ |
| datasets = { |
| "hPark/orca-ko": 200_000, # ๊ณ ํ์ง ํฉ์ฑ |
| "nayohan/llama3-instruct-ko-dataset": 58_000, # Llama3 ํ๊ตญ์ด |
| "FreedomIntelligence/evol-instruct-korean": 70_000, # GPT-4 ์์ฑ |
| } |
| # ๊ธฐ์กด 120K + ์ถ๊ฐ ~300K โ ํํฐ ํ ~350K |
| ``` |
|
|
| #### ์ฌํ์ต (2์๊ฐ) |
| ```bash |
| # ์ฆ๊ฐ๋ ๋ฐ์ดํฐ๋ก ์ฌํ์ต |
| RUN_NAME=korean_1b_sft_v3 \ |
| SFT_DATA=data/sft_v3/train.jsonl \ |
| MAX_STEPS=15000 \ |
| bash scripts/launch_sft.sh |
| ``` |
|
|
| ### Phase 3A ์ฑ๊ณต ๊ธฐ์ค |
|
|
| | ์งํ | ๋ชฉํ | |
| |------|------| |
| | ๋ฐ๋ณต๋ฅ (rep_penalty=1.1) | <5% | |
| | ko_ifeval | >20% | |
|
|
| **์คํจ ์**: 1B ํ๊ณ ์ธ์ , Phase 3B (3B ์ ํ)๋ก ๋ฐ๋ก ์ด๋. |
|
|
| --- |
|
|
| ## Phase 3B: 3B ์ฌ์ ํ์ต (Phase 2 ํต๊ณผ ํ, ~26์๊ฐ) |
|
|
| ### 3B ๋ชจ๋ธ ์ํคํ
์ฒ |
|
|
| | ํ๋ผ๋ฏธํฐ | 1B (ํ์ฌ) | 3B (๋ชฉํ) | ๋น๊ณ | |
| |---------|----------|----------|------| |
| | d_model | 2048 | 2560 | ~1.25x | |
| | n_layers | 24 | 32 | ~1.33x | |
| | n_heads | 16 | 32 | 2x | |
| | n_kv_heads (GQA) | 4 | 8 | 2x | |
| | d_ffn | 5472 | 6912 | ~1.26x | |
| | vocab_size | 64000 | 64000 | ๋์ผ | |
| | max_seq_len | 4096 | 4096 | ๋์ผ | |
| | **์ด ํ๋ผ๋ฏธํฐ** | **1.19B** | **~3.0B** | ~2.5x | |
| |
| ### ์ค์ ํ์ผ ์์ฑ |
| |
| ```bash |
| # configs/korean_3b_fp8.yaml ์์ฑ |
| cat > configs/korean_3b_fp8.yaml << 'EOF' |
| model: |
| d_model: 2560 |
| n_layers: 32 |
| n_heads: 32 |
| n_kv_heads: 8 |
| d_ffn: 6912 |
| vocab_size: 64000 |
| max_seq_len: 4096 |
| rope_theta: 500000 |
| |
| training: |
| lr: 3.0e-4 |
| min_lr: 3.0e-5 |
| warmup_steps: 2000 |
| max_steps: 100000 |
| batch_size: 4 |
| grad_accum: 4 |
| weight_decay: 0.1 |
| use_fp8: true |
|
|
| data: |
| sources: |
| - cc100_ko |
| - culturax_ko |
| - existing_pretrain |
| EOF |
| ``` |
| |
| ### ์ฌ์ ํ์ต ๋ฐ์ดํฐ |
| |
| | ์์ค | ํ ํฐ ์ | ์ํ | |
| |------|---------|------| |
| | CulturaX ko | 24.8B | โ
๋ณด์ | |
| | cc100 ko (์ฌ์์ง) | ~65-100B | โ ๏ธ ์ฌ์์ง ํ์ (๋
ธ์ด์ฆ ํํฐ๋ง) | |
| | ๊ธฐ์กด pretrain ๋ฐ์ดํฐ | ~8.9B | โ
๋ณด์ | |
| | ์ถ๊ฐ ์์ง (๋๋ฌด์ํค, ๋ด์ค ๋ฑ) | ~20-50B | ์ ํ์ | |
| | **ํฉ๊ณ** | **~120-180B** | Chinchilla 60B ์ต์ ์ถฉ์กฑ | |
| |
| **๋ฐ์ดํฐ ์ค๋น ๋ช
๋ น์ด**: |
| ```bash |
| # cc100 ์ฌ์์ง + ํ์ง ํํฐ๋ง |
| python scripts/download_cc100_ko.py --quality_filter --dedup |
| # MinHash dedup + perplexity filter |
| python scripts/quality_filter.py --input data/pretrain/ --max_ppl 1000 |
| ``` |
| |
| ### ํ์ต ์คํ |
| |
| ```bash |
| # 3B pretrain ์์ (8ร B200, ~26์๊ฐ) |
| bash scripts/run_pretrain.sh --config configs/korean_3b_fp8.yaml |
| |
| # ์์ ์ฒ๋ฆฌ ์๋: ~1.6M tok/s (8ร B200) |
| # 150B tokens / 1.6M tok/s โ 26์๊ฐ |
| ``` |
| |
| ### ๋ชจ๋ํฐ๋ง |
| |
| ```bash |
| # ๋ก๊ทธ ํ์ธ |
| tail -f checkpoints/korean_3b_fp8/train.log |
| |
| # ์ค๊ฐ ์ฒดํฌํฌ์ธํธ์์ base ํ์ง ํ์ธ (step 10000๋ง๋ค) |
| python eval/perplexity.py --checkpoint checkpoints/korean_3b_fp8/checkpoint-0010000 |
| ``` |
| |
| **์ฑ๊ณต ๊ธฐ์ค**: PPL < 10 (ํ๊ตญ์ด ํ
์คํธ), loss ์ง์ ํ๊ฐ |
| |
| --- |
| |
| ## Phase 4: 3B SFT (~2์๊ฐ) |
| |
| ### 1B์์ ๋ฐฐ์ด ๊ตํ ์ ๋ถ ์ ์ฉ |
| |
| | ๊ตํ | ์ ์ฉ | |
| |------|------| |
| | Dynamic padding ์๋ ํ์ธ | โ
sft_dataset.py ์์ ์๋ฃ, ๊ทธ๋๋ก ์ฌ์ฉ | |
| | EOS ๋ณด์กด | โ
๋์ผ ์ฝ๋ | |
| | Val split ํ์ | โ
10% split | |
| | 3-4 epoch | โ
MAX_STEPS ๊ณ์ฐํ์ฌ ์ค์ | |
| | OpenOrca ๊ณผ๋ค ๊ฐ์ค์น ๋ฐฉ์ง | โ
2.0x ์ดํ | |
| | ๋ฐ์ดํฐ ํ์ง ํํฐ๋ง | โ
Phase 0์์ ์์ฑํ ํด๋ฆฐ ๋ฐ์ดํฐ ์ฌ์ฉ | |
| | ์ฌ๋ฐ๋ฅธ ํ๋กฌํํธ ํฌ๋งท | โ
`<\|user\|>/<\|assistant\|>` | |
| |
| ### ์คํ |
| |
| ```bash |
| RUN_NAME=korean_3b_sft \ |
| BASE_CHECKPOINT=checkpoints/korean_3b_fp8/checkpoint-BEST \ |
| SFT_DATA=data/sft_v2/train.jsonl \ |
| VAL_DATA=data/sft_v2/val.jsonl \ |
| MAX_STEPS=10000 \ |
| LR=2.0e-5 \ |
| WARMUP_STEPS=300 \ |
| bash scripts/launch_sft.sh |
| ``` |
| |
| **์์ ์๊ฐ**: ~2์๊ฐ (3B๋ 1B ๋๋น ~2.5x ๋๋ฆผ) |
| |
| ### ์ฑ๊ณต ๊ธฐ์ค |
| |
| | ์งํ | ๋ชฉํ | |
| |------|------| |
| | Train Loss | < 1.85 | |
| | Val Loss | Train์ 1.1๋ฐฐ ์ด๋ด | |
| | ๋ฐ๋ณต๋ฅ (rep_penalty ์์ด) | < 10% | |
| | ๋ฐ๋ณต๋ฅ (rep_penalty=1.1) | < 3% | |
| |
| --- |
| |
| ## Phase 5: ํ๊ฐ ๋ฐ ๋ฐฐํฌ (~4์๊ฐ) |
| |
| ### 5-1. ์ ์ฒด ๋ฒค์น๋งํฌ (~2์๊ฐ) |
| |
| ```bash |
| # ko_ifeval |
| lm_eval --model hf \ |
| --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \ |
| --tasks ko_ifeval --device cuda:0 |
| |
| # ko_winogrande |
| lm_eval --model hf \ |
| --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \ |
| --tasks ko_winogrande --device cuda:0 |
| |
| # KoBEST (์ ํ) |
| lm_eval --model hf \ |
| --model_args pretrained=checkpoints/korean_3b_sft/checkpoint-BEST,dtype=bfloat16 \ |
| --tasks kobest_boolq,kobest_copa,kobest_wic,kobest_hellaswag,kobest_sentineg \ |
| --device cuda:0 |
| ``` |
| |
| **3B ๋ชฉํ ์์น**: |
|
|
| | ๋ฒค์น๋งํฌ | 1B ์์ | 3B ๋ชฉํ | |
| |---------|---------|---------| |
| | ko_ifeval | 20-30% | **35-45%** | |
| | ko_winogrande | 53-58% | **60-68%** | |
| | KoBEST (avg) | 55-60% | **65-75%** | |
| | ๋ฐ๋ณต๋ฅ | <5% | **<3%** | |
|
|
| ### 5-2. HuggingFace Hub ์
๋ก๋ (~1์๊ฐ) |
|
|
| ```bash |
| # HF ํฌ๋งท ๋ณํ |
| python scripts/convert_to_hf.py \ |
| --checkpoint checkpoints/korean_3b_sft/checkpoint-BEST \ |
| --output_dir hf_models/korean-3b-instruct |
| |
| # Model card ์์ฑ |
| cat > hf_models/korean-3b-instruct/README.md << 'EOF' |
| --- |
| language: ko |
| license: apache-2.0 |
| tags: |
| - korean |
| - llm |
| - instruction-tuning |
| --- |
| # Korean 3B Instruct |
| ...๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ, ์ฌ์ฉ๋ฒ ๋ฑ... |
| EOF |
| |
| # ์
๋ก๋ |
| huggingface-cli upload ghong/korean-3b-instruct hf_models/korean-3b-instruct |
| ``` |
|
|
| ### 5-3. vLLM ์๋น ์ค์ (~1์๊ฐ) |
|
|
| ```bash |
| # vLLM ์๋ฒ ์์ |
| python -m vllm.entrypoints.openai.api_server \ |
| --model hf_models/korean-3b-instruct \ |
| --dtype bfloat16 \ |
| --tensor-parallel-size 1 \ |
| --max-model-len 4096 \ |
| --port 8000 |
| |
| # ํ
์คํธ |
| curl http://localhost:8000/v1/chat/completions \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "model": "korean-3b-instruct", |
| "messages": [{"role": "user", "content": "ํ๊ตญ์ ์๋๋?"}], |
| "temperature": 0.7 |
| }' |
| ``` |
|
|
| **FP8 ์๋น (B200 ์ต์ )**: |
| ```bash |
| python -m vllm.entrypoints.openai.api_server \ |
| --model hf_models/korean-3b-instruct \ |
| --quantization fp8 \ |
| --tensor-parallel-size 1 \ |
| --max-model-len 4096 |
| ``` |
|
|
| **GGUF ๋ณํ (Ollama/๋ก์ปฌ ๋ฐฐํฌ)**: |
| ```bash |
| bash scripts/convert_to_gguf.sh checkpoints/korean_3b_sft/checkpoint-BEST |
| # Ollama Modelfile ์์ฑ ํ |
| ollama create korean-3b -f Modelfile |
| ``` |
|
|
| --- |
|
|
| ## ๐ Phase๋ณ ์์ฝ ํ
์ด๋ธ |
|
|
| | Phase | ์์ ์๊ฐ | ํ์ํ ๊ฒ | ์ฑ๊ณต ๊ธฐ์ค | ์คํจ ์ | |
| |-------|----------|----------|----------|---------| |
| | **0: ์ค๋น** | 30๋ถ | prepare_sft_data.py, sft_dataset.py ์์ | ํด๋ฆฐ ๋ฐ์ดํฐ 120K+, sanity 100steps ํต๊ณผ | ์ฝ๋ ๋๋ฒ๊ทธ | |
| | **1: 1B SFT** | 40๋ถ | 8รB200, ํด๋ฆฐ ๋ฐ์ดํฐ, ์์ ๋ ์ฝ๋ | Loss<1.90, ValLoss ์์ | LR ์กฐ์ or ๋ฐ์ดํฐ ์ฌ์ ๊ฒ | |
| | **2: 1B ํ๊ฐ** | 2์๊ฐ | lm-eval-harness, ํ๊ฐ ์คํฌ๋ฆฝํธ | ๋ฐ๋ณต๋ฅ <5%, ko_ifeval>25% | Phase 3A | |
| | **3A: ์ถ๊ฐ๊ฐ์ ** | 3-5์๊ฐ | Preference ๋ฐ์ดํฐ, ORPO/์ถ๊ฐ SFT | ๋ฐ๋ณต๋ฅ <5% ๋ฌ์ฑ | 1B ํ๊ณ ์ธ์ โ3B | |
| | **3B: 3B PT** | 26์๊ฐ | 150B+ ํ ํฐ, configs/korean_3b_fp8.yaml | PPL<10, loss ํ๊ฐ | ๋ฐ์ดํฐ ์ถ๊ฐ or ์ํคํ
์ฒ ์กฐ์ | |
| | **4: 3B SFT** | 2์๊ฐ | Phase 0์ ํด๋ฆฐ ๋ฐ์ดํฐ ์ฌ์ฌ์ฉ | Loss<1.85, ๋ฐ๋ณต๋ฅ <3% | LR/epoch ์กฐ์ | |
| | **5: ๋ฐฐํฌ** | 4์๊ฐ | HF ๊ณ์ , vLLM | ko_ifeval>35%, ์๋น ์ ์ | ๋ชจ๋ธ ๊ฐ์ ํ ์ฌ๋ฐฐํฌ | |
| |
| --- |
| |
| ## ๐ฅ ์ค๋ ๋น์ฅ ์์ํ ์ฒซ ๋ฒ์งธ ๋ช
๋ น์ด |
| |
| ```bash |
| cd /PROJECT/0325120031_A/ghong/taketimes/llm-bang |
| python data/prepare_sft_data.py --output_dir data/sft_v2/ --val_split 0.1 |
| ``` |
| |
| ์ด ๋ช
๋ น์ด ํ๋๋ก Phase 0์ ๊ฐ์ฅ ์ค์ํ ์์
(ํด๋ฆฐ ๋ฐ์ดํฐ ์์ฑ)์ด ์์๋๋ค. |
| |
| --- |
| |
| ## โก ๊ฐ์ฅ ์ค์ํ ํ๋จ ํฌ์ธํธ |
| |
| ### 1์ฐจ ํ๋จ: Phase 1 ์๋ฃ ํ (Step 10000) |
| - **Val Loss๊ฐ Train Loss์ 1.2๋ฐฐ ์ด์?** โ ๊ณผ์ ํฉ. Best checkpoint ์ฌ์ฉ. |
| - **Train Loss > 2.0?** โ ๋ฌด์ธ๊ฐ ์๋ชป๋จ. ์ฝ๋/๋ฐ์ดํฐ ์ฌ์ ๊ฒ. |
| |
| ### 2์ฐจ ํ๋จ: Phase 2 ํ๊ฐ ํ (๊ฐ์ฅ ์ค์!) |
| - **๋ฐ๋ณต๋ฅ <5% AND ko_ifeval >25%?** โ โ
3B ์ ํ (Phase 3B) |
| - **๋ฐ๋ณต๋ฅ 5-15%?** โ โ ๏ธ ORPO ์๋ (Phase 3A) |
| - **๋ฐ๋ณต๋ฅ >15%?** โ โ ์์ธ ๋ถ์. ๋ฐ์ดํฐ/์ฝ๋ ์ฌ๊ฒํ . |
| |
| ### 3์ฐจ ํ๋จ: Phase 3B ์ค๊ฐ (3B pretrain step 50000) |
| - **Loss ํ๊ฐ ๋ฉ์ถค?** โ ๋ฐ์ดํฐ ํ์ง ๋ฌธ์ . ํํฐ๋ง ๊ฐํ. |
| - **PPL > 15?** โ ๋ฐ์ดํฐ ๋ถ์กฑ. ์ถ๊ฐ ์์ง ํ์. |
| |
| --- |
| |
| ## ๐ก๏ธ ๋ฆฌ์คํฌ ๋งคํธ๋ฆญ์ค |
| |
| | ๋ฆฌ์คํฌ | ํ๋ฅ | ์ํฅ | ์๋ฐฉ/๋์ | |
| |--------|------|------|----------| |
| | Dynamic padding ์ฌ์ ํ ๋ฏธ์๋ | 10% | ๋์ (์๋ 3-8x ๋ญ๋น) | Sanity check์์ ๋ฐฐ์น ๊ธธ์ด ํ์ธ | |
| | ๋ฐ์ดํฐ ํํฐ๋ง ๊ณผ๋ค (100K ๋ฏธ๋ง) | 15% | ์ค๊ฐ | ํํฐ ๊ธฐ์ค ์ํ (80์โ50์) | |
| | 1B ์ฌํ์ต ํ์๋ ๋ฐ๋ณต >15% | 15% | ์ค๊ฐ | ORPO or 3B ์ ํ | |
| | 3B pretrain ์ค OOM | 10% | ๋์ | batch_size ์ค์ด๊ธฐ, gradient checkpointing | |
| | cc100 ์ฌ์์ง ์๊ฐ ์ด๊ณผ | 20% | ๋ฎ์ | CulturaX๋ง์ผ๋ก ์์ (24.8B) | |
| | ๋์คํฌ ๊ณต๊ฐ ๋ถ์กฑ | 5% | ๋์ | ํ์ฌ 19TB ๊ฐ์ฉ, ์ถฉ๋ถ | |
|
|
| --- |
|
|
| *"40๋ถ ์๋ผ๋ ค๊ณ ๊ธฐ์ ๋ถ์ฑ๋ฅผ ์๊ณ ๊ฐ์ง ๋ง๋ผ. 3์๊ฐ ํฌ์ํด์ ๊นจ๋ํ ๊ธฐ๋ฐ์ ๋ง๋ค์ด๋ผ."* |
|
|
| *์ด ๋ฌธ์๋ ๊ฐ Phase ์๋ฃ ์ ๊ฒฐ๊ณผ๋ก ์
๋ฐ์ดํธํ ๊ฒ.* |
|
|