gmkim commited on
Commit
14df015
·
verified ·
1 Parent(s): 96b1664

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -8
README.md CHANGED
@@ -1,20 +1,28 @@
1
  ---
2
  license: apache-2.0
3
  ---
4
- # Continual Post-training method from state of the art (SOTA) LLMs for MATH.
5
 
6
- ## Affliation
7
- **KRAFTON** & **SKT**
 
 
 
 
 
 
 
 
 
8
 
9
- ## Summary
10
- In this post, we explore a new approach to enhancing the reasoning capabilities of LLMs through continual post-training. While pre-training equips LLMs with broad linguistic knowledge, it often falls short in complex reasoning tasks like math or code. Recent models have shown that Reinforcement Learning with Verifiable Rewards (RLVR) can help bridge this gap, but existing methods rely on slow and limited on-policy training. We propose an off-policy alternative using teacher-generated trajectories and introduce a novel variant of Group Relative Policy Optimization (GRPO) that better captures high-quality reasoning traces—even when all outputs are positive. Our experiments on mathematical reasoning show that this method leads to consistent improvements.
11
 
12
  ## Results
 
13
  | Model | Method | AIME25 | AMC23 | LiveCodeBench | GPQA-Diamond | IFEval |
14
  |--------------------------------|--------------------------------|--------|-------|---------------|--------------|--------|
15
  | Openthinker3-7B | Base | 57.2915 | 92.617 | 63.968 | 50.947 | 50.09 |
16
- | | Off-policy GRPO (+bias) | 59.5315 | 93.516 | 64.995 | 49.684 | 51.66 |
17
  | Openthinker2-7B | Base | 39.792 | 88.633 | 56.115 | 45.833 | 53.3 |
18
- | | Off-policy GRPO (+bias) | 40.3645 | 87.656 | 55.944 | 46.843 | 52.20 |
19
  | AceReason-Nemetron-1.1-7B | Base | 64.635 | 92.93 | 72.383 | 52.462 | 36.02 |
20
- | | Off-policy GRPO (+bias) | 65.521 | 93.164 | 72.603 | 54.356 | 38.23 |
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # Continual Post-Training of LLMs via Offline GRPO for Mathematical Reasoning
5
 
6
+ ## Affiliation
7
+
8
+ KRAFTON & SKT
9
+
10
+ ## Overview
11
+
12
+ In this post, we explore a new approach to enhancing the reasoning capabilities of LLMs through continual post-training. While pre-training equips LLMs with broad linguistic knowledge, it often falls short in complex reasoning tasks like math or code. Recent models have shown that Reinforcement Learning with Verifiable Rewards (RLVR) can help bridge this gap, but existing methods rely on slow and limited online training. We propose an offline alternative using teacher-generated trajectories and introduce a novel variant of Group Relative Policy Optimization (GRPO) that better captures high-quality reasoning traces—even when all outputs are positive. Our experiments on mathematical reasoning show that this method leads to consistent improvements.
13
+
14
+ For more details, please refer to our blog
15
+ - [English Version](https://krafton-ai.github.io/blog/llm_post_training_en/)
16
+ - [Korean Version](https://krafton-ai.github.io/blog/llm_post_training_kr/)
17
 
 
 
18
 
19
  ## Results
20
+
21
  | Model | Method | AIME25 | AMC23 | LiveCodeBench | GPQA-Diamond | IFEval |
22
  |--------------------------------|--------------------------------|--------|-------|---------------|--------------|--------|
23
  | Openthinker3-7B | Base | 57.2915 | 92.617 | 63.968 | 50.947 | 50.09 |
24
+ | | Offline GRPO (+bias) | 59.5315 | 93.516 | 64.995 | 49.684 | 51.66 |
25
  | Openthinker2-7B | Base | 39.792 | 88.633 | 56.115 | 45.833 | 53.3 |
26
+ | | Offline GRPO (+bias) | 40.3645 | 87.656 | 55.944 | 46.843 | 52.20 |
27
  | AceReason-Nemetron-1.1-7B | Base | 64.635 | 92.93 | 72.383 | 52.462 | 36.02 |
28
+ | | Offline GRPO (+bias) | 65.521 | 93.164 | 72.603 | 54.356 | 38.23 |