File size: 6,229 Bytes
5b1ff4d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# λ‹€μš΄λ‘œλ“œ μš°μ„ μˆœμœ„ κ³„νš
> 생성일: 2026-02-27 | λ””μŠ€ν¬ μ—¬μœ : 19TB

## μ¦‰μ‹œ λ‹€μš΄λ‘œλ“œ Top 5 (μš°μ„ μˆœμœ„μˆœ)

---

### πŸ₯‡ Priority 1: FineWeb-Edu (Korean subset)
- **데이터셋:** `HuggingFaceFW/fineweb-edu`
- **μ™œ:** ꡐ윑 ν’ˆμ§ˆ ν•„ν„°λ§λœ μ›Ή 데이터, κ³ ν’ˆμ§ˆ(AκΈ‰). ν•œκ΅­μ–΄ μ„œλΈŒμ…‹λ§Œ μΆ”μΆœ κ°€λŠ₯
- **μ˜ˆμƒ:** 5~15B tokens (ν•œκ΅­μ–΄ λΆ€λΆ„)
- **μ ‘κ·Ό:** βœ… 무료, gated μ•„λ‹˜
- **μž„νŒ©νŠΈ:** κ³ ν’ˆμ§ˆ pretrain 토큰 λŒ€λŸ‰ 확보 + ꡐ윑 도메인 κ°•ν™”
```bash
# ν•œκ΅­μ–΄ μ„œλΈŒμ…‹ λ‹€μš΄λ‘œλ“œ
pip install datasets
python3 -c "
from datasets import load_dataset
ds = load_dataset('HuggingFaceFW/fineweb-edu', 'CC-MAIN-2024-10', split='train', streaming=True)
# language filter needed - fineweb-edu is primarily English
# Alternative: fineweb-edu-score filtered Korean web data
"
```
> ⚠️ 주의: fineweb-eduλŠ” λŒ€λΆ€λΆ„ μ˜μ–΄. ν•œκ΅­μ–΄ 비쀑 적을 수 있음. μ˜μ–΄ κ³ ν’ˆμ§ˆ λ³΄μΆ©μš©μœΌλ‘œλ„ κ°€μΉ˜ 있음.

---

### πŸ₯ˆ Priority 2: Korean Preference/DPO 데이터 (λ‹€μˆ˜ μ†ŒμŠ€)
- **데이터셋듀:**
  - `kuotient/orca-math-korean-preference` βœ…
  - `kuotient/orca-math-korean-dpo-pairs` βœ…  
  - `heegyu/orca-math-korean-preference-cleaned` βœ…
  - `ohsuz/dpo-v1010-korean` βœ…
  - `ChuGyouk/argilla-distilabel-math-preference-dpo-korean` βœ…
- **μ™œ:** Preference 데이터 **0건**인 ν˜„μž¬ μƒνƒœμ—μ„œ ORPO ν•™μŠ΅ 자체 λΆˆκ°€ β†’ κ°€μž₯ μ‹œκΈ‰
- **μ˜ˆμƒ:** 합계 30~60K 쌍
- **μ ‘κ·Ό:** βœ… λͺ¨λ‘ 무료
- **μž„νŒ©νŠΈ:** ORPO/DPO ν•™μŠ΅ νŒŒμ΄ν”„λΌμΈ ν™œμ„±ν™”
```bash
python3 << 'PYEOF'
from datasets import load_dataset
import json, os

out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/preference"
os.makedirs(out_dir, exist_ok=True)

datasets_to_dl = [
    ("kuotient/orca-math-korean-preference", None),
    ("kuotient/orca-math-korean-dpo-pairs", None),
    ("heegyu/orca-math-korean-preference-cleaned", None),
    ("ohsuz/dpo-v1010-korean", None),
]

for name, config in datasets_to_dl:
    try:
        ds = load_dataset(name, config, split="train")
        safe_name = name.replace("/", "_")
        ds.to_json(f"{out_dir}/{safe_name}.jsonl")
        print(f"βœ… {name}: {len(ds)} samples")
    except Exception as e:
        print(f"❌ {name}: {e}")
PYEOF
```

---

### πŸ₯‰ Priority 3: RedPajama-Data-1T (μ˜μ–΄ κ³ ν’ˆμ§ˆ μ„œλΈŒμ…‹)
- **데이터셋:** `togethercomputer/RedPajama-Data-1T`
- **μ™œ:** μ˜μ–΄ 데이터 극히 λΆ€μ‘± (0.6B). μ½”λ“œ/ArXiv/Book/StackExchange μ„œλΈŒμ…‹ 선별 λ‹€μš΄λ‘œλ“œ
- **μ˜ˆμƒ:** 선별 10~20B tokens (μ½”λ“œ 5B + ArXiv 3B + Book 2B + SE 2B)
- **μ ‘κ·Ό:** βœ… 무료
- **μž„νŒ©νŠΈ:** μ½”λ“œ/κ³Όν•™/μΆ”λ‘  λŠ₯λ ₯ + cross-lingual transfer λŒ€ν­ κ°•ν™”
```bash
python3 << 'PYEOF'
from datasets import load_dataset

# μ½”λ“œ μ„œλΈŒμ…‹λ§Œ λ¨Όμ € (github subset)
ds = load_dataset("togethercomputer/RedPajama-Data-1T", "github", 
                   split="train", streaming=True,
                   cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
# ArXiv subset
ds_arxiv = load_dataset("togethercomputer/RedPajama-Data-1T", "arxiv",
                         split="train", streaming=True,
                         cache_dir="/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/redpajama")
PYEOF
```

---

### 4️⃣ Priority 4: ν•œκ΅­μ–΄ SFT λ‹€μ–‘μ„± 보강
- **데이터셋듀:**
  - `kyujinpy/KOR-OpenOrca-Platypus-v3` βœ… (μΆ”λ‘ /μˆ˜ν•™)
  - `maywell/ko_wikidata_QA` βœ… (지식 QA)
  - `nlpai-lab/kullm-v2` βœ… (λ²”μš© μ§€μ‹œ)
- **μ™œ:** ν˜„μž¬ SFT 170K은 양적 μΆ©λΆ„ν•˜λ‚˜ μ½”λ“œ/μˆ˜ν•™/μΆ”λ‘  도메인 λΆ€μ‘±
- **μ˜ˆμƒ:** +50~100K λ‹€μ–‘ν•œ 도메인 μƒ˜ν”Œ
- **μ ‘κ·Ό:** βœ… λͺ¨λ‘ 무료
```bash
python3 << 'PYEOF'
from datasets import load_dataset
import os

out_dir = "/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/sft_extra"
os.makedirs(out_dir, exist_ok=True)

for name in ["kyujinpy/KOR-OpenOrca-Platypus-v3", "maywell/ko_wikidata_QA", "nlpai-lab/kullm-v2"]:
    try:
        ds = load_dataset(name, split="train")
        safe = name.replace("/","_")
        ds.to_json(f"{out_dir}/{safe}.jsonl")
        print(f"βœ… {name}: {len(ds)}")
    except Exception as e:
        print(f"❌ {name}: {e}")
PYEOF
```

---

### 5️⃣ Priority 5: Open-Web-Math (μˆ˜ν•™ νŠΉν™”)
- **데이터셋:** `open-web-math/open-web-math`
- **μ™œ:** μˆ˜ν•™ 데이터 전무. μˆ˜ν•™ λŠ₯λ ₯은 LLM 벀치마크 핡심 μ˜μ—­
- **μ˜ˆμƒ:** ~14B tokens (μ˜μ–΄ μˆ˜ν•™)
- **μ ‘κ·Ό:** βœ… 무료
- **μž„νŒ©νŠΈ:** μˆ˜ν•™ μΆ”λ‘  λŠ₯λ ₯ 기반 확보
```bash
python3 -c "
from datasets import load_dataset
ds = load_dataset('open-web-math/open-web-math', split='train', streaming=True,
                   cache_dir='/PROJECT/0325120031_A/ghong/taketimes/llm-bang/data/open-web-math')
# Stream and save
"
```

---

## λ‹€μš΄λ‘œλ“œ ν›„ μ˜ˆμƒ 토큰 뢄포

| μΉ΄ν…Œκ³ λ¦¬ | ν˜„μž¬ | μΆ”κ°€ | 합계 |
|---------|------|------|------|
| ν•œκ΅­μ–΄ Pretrain | 39B | +5~10B (fineweb-edu ko) | 44~49B |
| μ˜μ–΄ μ½”λ“œ | 0 | +5B (RedPajama github) | 5B |
| μ˜μ–΄ κ³Όν•™/ArXiv | 0 | +3B (RedPajama arxiv) | 3B |
| μ˜μ–΄ μˆ˜ν•™ | 0 | +10B (open-web-math) | 10B |
| μ˜μ–΄ 기타 κ³ ν’ˆμ§ˆ | 0.6B | +5B (RedPajama book+SE) | 5.6B |
| **Pretrain 합계** | **~39B** | **+28~33B** | **~67~72B** |
| SFT | 170K | +50~100K | 220~270K |
| Preference | 0 | +30~60K 쌍 | 30~60K 쌍 |

### λͺ©ν‘œ 달성 μ—¬λΆ€
- βœ… Chinchilla minimum (60B) 달성 κ°€λŠ₯
- βœ… ORPO/DPO ν•™μŠ΅ κ°€λŠ₯
- βœ… μ½”λ“œ/μˆ˜ν•™/κ³Όν•™ 도메인 컀버
- 🟑 Chinchilla optimal (210B)μ—λŠ” μ—¬μ „νžˆ λΆ€μ‘± β†’ μΆ”ν›„ CulturaX 전체, SlimPajama λ“± μΆ”κ°€ κ²€ν† 

---

## 데이터 믹슀 ꢌμž₯ λΉ„μœ¨ (ν•™μŠ΅ μ‹œ)

```
ν•œκ΅­μ–΄ ν…μŠ€νŠΈ:  50% (~35B tokens)
μ˜μ–΄ μ½”λ“œ:     15% (~10B tokens)  
μ˜μ–΄ μˆ˜ν•™/κ³Όν•™: 15% (~10B tokens)
μ˜μ–΄ 일반:     15% (~10B tokens)
ν•œκ΅­μ–΄ ꡐ윑:    5% (~3B tokens)
```

## μ£Όμ˜μ‚¬ν•­
1. CulturaXλŠ” gated(auto) β†’ HuggingFaceμ—μ„œ λ™μ˜ ν•„μš” (이미 λ‹€μš΄λ°›μ€ 60GB ν™œμš©)
2. the-stack-dedup도 gated β†’ 승인 ν•„μš”, RedPajama github둜 λŒ€μ²΄
3. λ‹€μš΄λ‘œλ“œ μ „ `huggingface-cli login --token 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` μ‹€ν–‰
4. λŒ€μš©λŸ‰ λ‹€μš΄λ‘œλ“œ μ‹œ `HF_HUB_ENABLE_HF_TRANSFER=1` ν™˜κ²½λ³€μˆ˜ μ„€μ • ꢌμž₯