mjkmain commited on
Commit
26b0cf9
ยท
verified ยท
1 Parent(s): a859fa0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -57
README.md CHANGED
@@ -7,7 +7,7 @@ license: apache-2.0
7
  </p> -->
8
 
9
  <p align="center">
10
- <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width: 40%; max-width: 1100px;">
11
  </p>
12
 
13
 
@@ -25,21 +25,27 @@ The model, training code, and training data are all **fully open**, allowing any
25
  - ๐Ÿงช **License**: Apache 2.0 (commercial use permitted)
26
 
27
  ```bash
28
- KORMo๋Š” ๋น„์˜์–ด๊ถŒ ์ตœ์ดˆ์˜ Fully Open Source LLM์œผ๋กœ, ๊ณต์ต์  ํ™œ์šฉ์„ ๋ชฉํ‘œ๋กœ ํƒ„์ƒํ–ˆ์Šต๋‹ˆ๋‹ค.
29
- ์šฐ๋ฆฌ๋Š” ๋ˆ„๊ตฌ๋‚˜ ์„ธ๊ณ„ ์ˆ˜์ค€์˜ ์–ธ์–ด๋ชจ๋ธ์„ ์ง์ ‘ ๋งŒ๋“ค๊ณ  ๋ฐœ์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์„ ๋งŒ๋“ค๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.
30
- KORMo์˜ ์ฃผ์š” ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
31
 
32
- 1. From scratch ํ•™์Šต์œผ๋กœ ์„ค๊ณ„๋œ 10B๊ธ‰ ํ•œโ€“์˜ ์ถ”๋ก  ์–ธ์–ด๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
33
- 2. ํ•™์Šต ๋ฐ์ดํ„ฐ, ์ฝ”๋“œ, ๋ชจ๋“  ์ค‘๊ฐ„ ๋ชจ๋ธ๊ณผ ํŠœํ† ๋ฆฌ์–ผ์„ 100% ๊ณต๊ฐœํ•˜์—ฌ, ๋ˆ„๊ตฌ๋‚˜ SOTA์— ๊ทผ์ ‘ํ•œ ๋ชจ๋ธ์„ ์ง์ ‘ ์žฌํ˜„ํ•˜๊ณ  ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
34
- 3. ์ด 3.7T ํ† ํฐ ๊ทœ๋ชจ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์ง€๊ธˆ๊นŒ์ง€ ํ•œ ๋ฒˆ๋„ ๊ณต๊ฐœ๋œ ์  ์—†๋Š” ์ดˆ๊ณ ํ’ˆ์งˆ ์ „์ฃผ๊ธฐ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ(์‚ฌ์ „ํ•™์Šต, ์‚ฌํ›„ํ•™์Šต, ์ผ๋ฐ˜ํ˜•, ์ถ”๋ก ํ˜•, ๊ฐ•ํ™”ํ•™์Šต ๋“ฑ)๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
35
- 4. ์ด ๋ชจ๋“  ์ž‘์—…์€ KAIST ๋ฌธํ™”๊ธฐ์ˆ ๋Œ€ํ•™์› MLP์—ฐ๊ตฌ์‹ค์˜ ํ•™๋ถ€ยท์„์‚ฌ์ƒ 8๋ช…์ด ํ˜‘๋ ฅํ•˜์—ฌ ์ง„ํ–‰ํ–ˆ์œผ๋ฉฐ, 45์žฅ์— ๋‹ฌํ•˜๋Š” ๋…ผ๋ฌธ์œผ๋กœ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.
36
 
37
- ์ง€๊ธˆ๊นŒ์ง€ ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ์จ๋ณด๋ฉด, ๋ฒค์น˜๋งˆํฌ ์ ์ˆ˜๋Š” ์ข‹์€๋ฐ ์‹ค์‚ฌ์šฉ์—์„œ๋Š” ์–ด๋”˜๊ฐ€ ์ด์ƒํ•˜๊ฑฐ๋‚˜,
38
- ํŠœ๋‹๋งŒ ํ•˜๋ฉด ๋ชจ๋ธ์ด ๋ง๊ฐ€์ง€๋Š” ๊ฒฝํ—˜์„ ํ•˜์…จ์„ ๊ฒ๋‹ˆ๋‹ค. ๋‹ต๋‹ตํ•˜์…จ์ฃ ?
39
 
40
- KORMo๋Š” ๊ทธ๋Ÿฐ ๋ฌธ์ œ๋ฅผ ์ •๋ฉด์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
41
- ๋ชจ๋“  ์ค‘๊ฐ„ ๋ชจ๋ธ๊ณผ ์‚ฌํ›„ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ•จ๊ป˜ ๊ณต๊ฐœํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์‚ฌ์šฉ์ž๋Š” ๋ฒ ์ด์Šค ๋ชจ๋ธ ์œ„์— ์ž์‹ ๋งŒ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์–น์–ด ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ•ํ™”ํ•™์ŠตยทํŠœ๋‹์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
42
- ๐Ÿ‘‰ โ€œ์ข‹์€ ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๊ฐ–๊ณ  ์‹ถ๋‹ค๋ฉด, ์ด์ œ ์ง์ ‘ ๋งŒ๋“ค์–ด๋ณด์„ธ์š”. ์ฝ”๋žฉ ๋ฌด๋ฃŒ GPU๋กœ๋„ ํŠœ๋‹๋ฉ๋‹ˆ๋‹ค! ๐Ÿค—โ€
 
 
 
 
 
 
 
 
 
 
 
43
  ```
44
 
45
  ---
@@ -180,50 +186,6 @@ chat_prompt = tokenizer.apply_chat_template(
180
  ```
181
  ---
182
 
183
- ## ๐Ÿช„ Using Specific Revisions (Training Checkpoints)
184
-
185
- KORMo provides multiple model revisions corresponding to different training stages and checkpoints.
186
- You can load a specific revision with the `revision` parameter in `from_pretrained`.
187
-
188
- ### ๐Ÿ“ Stage 1 Model (sft-stage1)
189
-
190
- ```python
191
- from transformers import AutoModelForCausalLM, AutoTokenizer
192
- import torch
193
-
194
- model_name = "KORMo-Team/KORMo-10B-sft"
195
- tokenizer = AutoTokenizer.from_pretrained(model_name)
196
- model = AutoModelForCausalLM.from_pretrained(
197
- model_name,
198
- revision="sft-stage1", # Load Stage 1 checkpoint
199
- torch_dtype=torch.bfloat16,
200
- device_map="auto",
201
- trust_remote_code=True
202
- )
203
- ```
204
-
205
- ### ๐Ÿš€ Main Model (Final Checkpoint: sft-stage2-ckpt2)
206
-
207
- ```python
208
- from transformers import AutoModelForCausalLM, AutoTokenizer
209
- import torch
210
-
211
- model_name = "KORMo-Team/KORMo-10B-sft"
212
- tokenizer = AutoTokenizer.from_pretrained(model_name)
213
- model = AutoModelForCausalLM.from_pretrained(
214
- model_name,
215
- revision="sft-stage2-ckpt2", # Load Final Main Checkpoint
216
- torch_dtype=torch.bfloat16,
217
- device_map="auto",
218
- trust_remote_code=True
219
- )
220
- ```
221
-
222
- > ๐Ÿ’ก **Tip**:
223
- > - Use `sft-stage1` for ablation studies or comparison experiments.
224
- > - Use `sft-stage2-ckpt2` as the **main production model**.
225
-
226
- ---
227
 
228
 
229
  ## Contact
 
7
  </p> -->
8
 
9
  <p align="center">
10
+ <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width: 100%; max-width: 1100px;">
11
  </p>
12
 
13
 
 
25
  - ๐Ÿงช **License**: Apache 2.0 (commercial use permitted)
26
 
27
  ```bash
28
+ KORMo: The First Fully Open-Source LLM from a Non-English Region
 
 
29
 
30
+ KORMo was created with a public-interest mission โ€” to make world-class language models accessible to everyone.
31
+ Our goal is to empower anyone to build and advance their own large language models at a global standard.
 
 
32
 
33
+ Key Features:
 
34
 
35
+ 1. A 10B-parameter Koreanโ€“English reasoning model trained entirely from scratch.
36
+
37
+ 2. 100% open resources โ€” including all training data, code, intermediate checkpoints, and tutorials โ€” allowing anyone to reproduce and extend a near-SOTA model on their own.
38
+
39
+ 3. 3 trillion tokens of training data released publicly, featuring never-before-shared, high-quality full-cycle Korean datasets (for pretraining, post-training, general, reasoning, and reinforcement learning).
40
+
41
+ 4. A collaborative effort by eight undergraduate and masterโ€™s students at the KAIST Graduate School of Culture Technology (MLP Lab), documented in a 45-page research paper.
42
+
43
+ If youโ€™ve ever used a Korean language model that performs well on benchmarks but feels strange in real use, or if fine-tuning only made it worse, youโ€™re not alone.
44
+
45
+ KORMo solves these problems head-on.
46
+ By releasing every intermediate model and post-training dataset, we give users the freedom to build on the base model with their own data, customizing and fine-tuning it in any direction they want.
47
+
48
+ ๐Ÿ‘‰ โ€œIf you want a great Korean language model, now you can build it yourself. It even works with free Colab GPUs!โ€ ๐Ÿค—
49
  ```
50
 
51
  ---
 
186
  ```
187
  ---
188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
 
191
  ## Contact