KORMo-Team
/

KORMo-10B-base

@@ -7,7 +7,7 @@ license: apache-2.0
 </p> -->
 <p align="center">
-  <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width: 40%; max-width: 1100px;">
 </p>
@@ -25,21 +25,27 @@ The model, training code, and training data are all **fully open**, allowing any
 - 🧪 **License**: Apache 2.0 (commercial use permitted)
 ```bash
-KORMo는 비영어권 최초의 Fully Open Source LLM으로, 공익적 활용을 목표로 탄생했습니다.
-우리는 누구나 세계 수준의 언어모델을 직접 만들고 발전시킬 수 있는 환경을 만들고자 합니다.
-KORMo의 주요 특징은 다음과 같습니다:
-1. From scratch 학습으로 설계된 10B급 한–영 추론 언어모델입니다.
-2. 학습 데이터, 코드, 모든 중간 모델과 튜토리얼을 100% 공개하여, 누구나 SOTA에 근접한 모델을 직접 재현하고 확장할 수 있습니다.
-3. 총 3.7T 토큰 규모의 학습 데이터를 공개합니다. 특히 지금까지 한 번도 공개된 적 없는 초고품질 전주기 한국어 데이터(사전학습, 사후학습, 일반형, 추론형, 강화학습 등)를 제공합니다.
-4. 이 모든 작업은 KAIST 문화기술대학원 MLP연구실의 학부·석사생 8명이 협력하여 진행했으며, 45장에 달하는 논문으로 정리했습니다.
-지금까지 한국어 모델을 써보면, 벤치마크 점수는 좋은데 실사용에서는 어딘가 이상하거나,
-튜닝만 하면 모델이 망가지는 경험을 하셨을 겁니다. 답답하셨죠?
-KORMo는 그런 문제를 정면으로 해결합니다.
-모든 중간 모델과 사후학습 데이터를 함께 공개하기 때문에, 사용자는 베이스 모델 위에 자신만의 데이터를 얹어 원하는 방향으로 강화학습·튜닝을 진행할 수 있습니다.
-👉 “좋은 한국어 모델을 갖고 싶다면, 이제 직접 만들어보세요. 코랩 무료 GPU로도 튜닝됩니다! 🤗”
 ```
 ---
@@ -180,50 +186,6 @@ chat_prompt = tokenizer.apply_chat_template(
 ```
 ---
-## 🪄 Using Specific Revisions (Training Checkpoints)
-KORMo provides multiple model revisions corresponding to different training stages and checkpoints.
-You can load a specific revision with the `revision` parameter in `from_pretrained`.
-### 📍 Stage 1 Model (sft-stage1)
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model_name = "KORMo-Team/KORMo-10B-sft"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    revision="sft-stage1",  # Load Stage 1 checkpoint
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
-)
-```
-### 🚀 Main Model (Final Checkpoint: sft-stage2-ckpt2)
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model_name = "KORMo-Team/KORMo-10B-sft"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    revision="sft-stage2-ckpt2",  # Load Final Main Checkpoint
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
-)
-```
-> 💡 **Tip**:
-> - Use `sft-stage1` for ablation studies or comparison experiments.
-> - Use `sft-stage2-ckpt2` as the **main production model**.
----
 ## Contact

 </p> -->
 <p align="center">
+  <img src="https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true" style="width: 100%; max-width: 1100px;">
 </p>
 - 🧪 **License**: Apache 2.0 (commercial use permitted)
 ```bash
+KORMo: The First Fully Open-Source LLM from a Non-English Region
+KORMo was created with a public-interest mission — to make world-class language models accessible to everyone.
+Our goal is to empower anyone to build and advance their own large language models at a global standard.
+Key Features:
+1. A 10B-parameter Korean–English reasoning model trained entirely from scratch.
+2. 100% open resources — including all training data, code, intermediate checkpoints, and tutorials — allowing anyone to reproduce and extend a near-SOTA model on their own.
+3. 3 trillion tokens of training data released publicly, featuring never-before-shared, high-quality full-cycle Korean datasets (for pretraining, post-training, general, reasoning, and reinforcement learning).
+4. A collaborative effort by eight undergraduate and master’s students at the KAIST Graduate School of Culture Technology (MLP Lab), documented in a 45-page research paper.
+If you’ve ever used a Korean language model that performs well on benchmarks but feels strange in real use, or if fine-tuning only made it worse, you’re not alone.
+KORMo solves these problems head-on.
+By releasing every intermediate model and post-training dataset, we give users the freedom to build on the base model with their own data, customizing and fine-tuning it in any direction they want.
+👉 “If you want a great Korean language model, now you can build it yourself. It even works with free Colab GPUs!” 🤗
 ```
 ---
 ```
 ---
 ## Contact