Update README.md
Browse files
README.md
CHANGED
|
@@ -7,7 +7,7 @@ license: apache-2.0
|
|
| 7 |
</p> -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
-
<img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width:
|
| 11 |
</p>
|
| 12 |
|
| 13 |
|
|
@@ -25,21 +25,27 @@ The model, training code, and training data are all **fully open**, allowing any
|
|
| 25 |
- ๐งช **License**: Apache 2.0 (commercial use permitted)
|
| 26 |
|
| 27 |
```bash
|
| 28 |
-
KORMo
|
| 29 |
-
์ฐ๋ฆฌ๋ ๋๊ตฌ๋ ์ธ๊ณ ์์ค์ ์ธ์ด๋ชจ๋ธ์ ์ง์ ๋ง๋ค๊ณ ๋ฐ์ ์ํฌ ์ ์๋ ํ๊ฒฝ์ ๋ง๋ค๊ณ ์ ํฉ๋๋ค.
|
| 30 |
-
KORMo์ ์ฃผ์ ํน์ง์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค:
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
3. ์ด 3.7T ํ ํฐ ๊ท๋ชจ์ ํ์ต ๋ฐ์ดํฐ๋ฅผ ๊ณต๊ฐํฉ๋๋ค. ํนํ ์ง๊ธ๊น์ง ํ ๋ฒ๋ ๊ณต๊ฐ๋ ์ ์๋ ์ด๊ณ ํ์ง ์ ์ฃผ๊ธฐ ํ๊ตญ์ด ๋ฐ์ดํฐ(์ฌ์ ํ์ต, ์ฌํํ์ต, ์ผ๋ฐํ, ์ถ๋ก ํ, ๊ฐํํ์ต ๋ฑ)๋ฅผ ์ ๊ณตํฉ๋๋ค.
|
| 35 |
-
4. ์ด ๋ชจ๋ ์์
์ KAIST ๋ฌธํ๊ธฐ์ ๋ํ์ MLP์ฐ๊ตฌ์ค์ ํ๋ถยท์์ฌ์ 8๋ช
์ด ํ๋ ฅํ์ฌ ์งํํ์ผ๋ฉฐ, 45์ฅ์ ๋ฌํ๋ ๋
ผ๋ฌธ์ผ๋ก ์ ๋ฆฌํ์ต๋๋ค.
|
| 36 |
|
| 37 |
-
|
| 38 |
-
ํ๋๋ง ํ๋ฉด ๋ชจ๋ธ์ด ๋ง๊ฐ์ง๋ ๊ฒฝํ์ ํ์
จ์ ๊ฒ๋๋ค. ๋ต๋ตํ์
จ์ฃ ?
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
```
|
| 44 |
|
| 45 |
---
|
|
@@ -180,50 +186,6 @@ chat_prompt = tokenizer.apply_chat_template(
|
|
| 180 |
```
|
| 181 |
---
|
| 182 |
|
| 183 |
-
## ๐ช Using Specific Revisions (Training Checkpoints)
|
| 184 |
-
|
| 185 |
-
KORMo provides multiple model revisions corresponding to different training stages and checkpoints.
|
| 186 |
-
You can load a specific revision with the `revision` parameter in `from_pretrained`.
|
| 187 |
-
|
| 188 |
-
### ๐ Stage 1 Model (sft-stage1)
|
| 189 |
-
|
| 190 |
-
```python
|
| 191 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 192 |
-
import torch
|
| 193 |
-
|
| 194 |
-
model_name = "KORMo-Team/KORMo-10B-sft"
|
| 195 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 196 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 197 |
-
model_name,
|
| 198 |
-
revision="sft-stage1", # Load Stage 1 checkpoint
|
| 199 |
-
torch_dtype=torch.bfloat16,
|
| 200 |
-
device_map="auto",
|
| 201 |
-
trust_remote_code=True
|
| 202 |
-
)
|
| 203 |
-
```
|
| 204 |
-
|
| 205 |
-
### ๐ Main Model (Final Checkpoint: sft-stage2-ckpt2)
|
| 206 |
-
|
| 207 |
-
```python
|
| 208 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 209 |
-
import torch
|
| 210 |
-
|
| 211 |
-
model_name = "KORMo-Team/KORMo-10B-sft"
|
| 212 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 213 |
-
model = AutoModelForCausalLM.from_pretrained(
|
| 214 |
-
model_name,
|
| 215 |
-
revision="sft-stage2-ckpt2", # Load Final Main Checkpoint
|
| 216 |
-
torch_dtype=torch.bfloat16,
|
| 217 |
-
device_map="auto",
|
| 218 |
-
trust_remote_code=True
|
| 219 |
-
)
|
| 220 |
-
```
|
| 221 |
-
|
| 222 |
-
> ๐ก **Tip**:
|
| 223 |
-
> - Use `sft-stage1` for ablation studies or comparison experiments.
|
| 224 |
-
> - Use `sft-stage2-ckpt2` as the **main production model**.
|
| 225 |
-
|
| 226 |
-
---
|
| 227 |
|
| 228 |
|
| 229 |
## Contact
|
|
|
|
| 7 |
</p> -->
|
| 8 |
|
| 9 |
<p align="center">
|
| 10 |
+
<img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width: 100%; max-width: 1100px;">
|
| 11 |
</p>
|
| 12 |
|
| 13 |
|
|
|
|
| 25 |
- ๐งช **License**: Apache 2.0 (commercial use permitted)
|
| 26 |
|
| 27 |
```bash
|
| 28 |
+
KORMo: The First Fully Open-Source LLM from a Non-English Region
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
KORMo was created with a public-interest mission โ to make world-class language models accessible to everyone.
|
| 31 |
+
Our goal is to empower anyone to build and advance their own large language models at a global standard.
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
Key Features:
|
|
|
|
| 34 |
|
| 35 |
+
1. A 10B-parameter KoreanโEnglish reasoning model trained entirely from scratch.
|
| 36 |
+
|
| 37 |
+
2. 100% open resources โ including all training data, code, intermediate checkpoints, and tutorials โ allowing anyone to reproduce and extend a near-SOTA model on their own.
|
| 38 |
+
|
| 39 |
+
3. 3 trillion tokens of training data released publicly, featuring never-before-shared, high-quality full-cycle Korean datasets (for pretraining, post-training, general, reasoning, and reinforcement learning).
|
| 40 |
+
|
| 41 |
+
4. A collaborative effort by eight undergraduate and masterโs students at the KAIST Graduate School of Culture Technology (MLP Lab), documented in a 45-page research paper.
|
| 42 |
+
|
| 43 |
+
If youโve ever used a Korean language model that performs well on benchmarks but feels strange in real use, or if fine-tuning only made it worse, youโre not alone.
|
| 44 |
+
|
| 45 |
+
KORMo solves these problems head-on.
|
| 46 |
+
By releasing every intermediate model and post-training dataset, we give users the freedom to build on the base model with their own data, customizing and fine-tuning it in any direction they want.
|
| 47 |
+
|
| 48 |
+
๐ โIf you want a great Korean language model, now you can build it yourself. It even works with free Colab GPUs!โ ๐ค
|
| 49 |
```
|
| 50 |
|
| 51 |
---
|
|
|
|
| 186 |
```
|
| 187 |
---
|
| 188 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
|
| 191 |
## Contact
|