Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-generation
|
| 3 |
+
license: other
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
- zh
|
| 7 |
+
tags:
|
| 8 |
+
- math
|
| 9 |
+
base_model: internlm/internlm2-math-plus-1_8b
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# InternLM-Math-Plus-GGUF
|
| 13 |
+
This is quantized version of [internlm/internlm2-math-plus-1_8b]() created using llama.cpp
|
| 14 |
+
|
| 15 |
+
# Model Description
|
| 16 |
+
<div align="center">
|
| 17 |
+
|
| 18 |
+
<img src="https://raw.githubusercontent.com/InternLM/InternLM/main/assets/logo.svg" width="200"/>
|
| 19 |
+
<div> </div>
|
| 20 |
+
<div align="center">
|
| 21 |
+
<b><font size="5">InternLM-Math</font></b>
|
| 22 |
+
<sup>
|
| 23 |
+
<a href="https://internlm.intern-ai.org.cn/">
|
| 24 |
+
<i><font size="4">Plus</font></i>
|
| 25 |
+
</a>
|
| 26 |
+
</sup>
|
| 27 |
+
<div> </div>
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
State-of-the-art bilingual open-sourced Math reasoning LLMs.
|
| 31 |
+
A **solver**, **prover**, **verifier**, **augmentor**.
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
# News
|
| 35 |
+
- [2024.05.24] We release updated version InternLM2-Math-Plus with 4 sizes and state-of-the-art performances including 1.8B, 7B, 20B, and 8x22B. We improve informal math reasoning performance (chain-of-thought and code-intepreter) and formal math reasoning performance (LEAN 4 translation and LEAN 4 theorem proving) significantly.
|
| 36 |
+
- [2024.02.10] We add tech reports and citation reference.
|
| 37 |
+
- [2024.01.31] We add MiniF2F results with evaluation codes!
|
| 38 |
+
- [2024.01.29] We add checkpoints from ModelScope. Update results about majority voting and Code Intepreter. Tech report is on the way!
|
| 39 |
+
- [2024.01.26] We add checkpoints from OpenXLab, which ease Chinese users to download!
|
| 40 |
+
|
| 41 |
+
# Performance
|
| 42 |
+
|
| 43 |
+
## Formal Math Reasoning
|
| 44 |
+
We evaluate the performance of InternLM2-Math-Plus on formal math reasoning benchmark MiniF2F-test. The evaluation setting is same as Llemma with LEAN 4.
|
| 45 |
+
| Models | MiniF2F-test |
|
| 46 |
+
| -------------------------------- | ------------ |
|
| 47 |
+
| ReProver | 26.5 |
|
| 48 |
+
| LLMStep | 27.9 |
|
| 49 |
+
| GPT-F | 36.6 |
|
| 50 |
+
| HTPS | 41.0 |
|
| 51 |
+
| Llemma-7B | 26.2 |
|
| 52 |
+
| Llemma-34B | 25.8 |
|
| 53 |
+
| InternLM2-Math-7B-Base | 30.3 |
|
| 54 |
+
| InternLM2-Math-20B-Base | 29.5 |
|
| 55 |
+
| InternLM2-Math-Plus-1.8B | 38.9 |
|
| 56 |
+
| InternLM2-Math-Plus-7B | **43.4** |
|
| 57 |
+
| InternLM2-Math-Plus-20B | 42.6 |
|
| 58 |
+
| InternLM2-Math-Plus-Mixtral8x22B | 37.3 |
|
| 59 |
+
|
| 60 |
+
## Informal Math Reasoning
|
| 61 |
+
We evaluate the performance of InternLM2-Math-Plus on informal math reasoning benchmark MATH and GSM8K. InternLM2-Math-Plus-1.8B outperforms MiniCPM-2B in the smallest size setting. InternLM2-Math-Plus-7B outperforms Deepseek-Math-7B-RL which is the state-of-the-art math reasoning open source model. InternLM2-Math-Plus-Mixtral8x22B achieves 68.5 on MATH (with Python) and 91.8 on GSM8K.
|
| 62 |
+
| Model | MATH | MATH-Python | GSM8K |
|
| 63 |
+
| -------------------------------- | -------- | ----------- | -------- |
|
| 64 |
+
| MiniCPM-2B | 10.2 | - | 53.8 |
|
| 65 |
+
| InternLM2-Math-Plus-1.8B | **37.0** | **41.5** | **58.8** |
|
| 66 |
+
| InternLM2-Math-7B | 34.6 | 50.9 | 78.1 |
|
| 67 |
+
| Deepseek-Math-7B-RL | 51.7 | 58.8 | **88.2** |
|
| 68 |
+
| InternLM2-Math-Plus-7B | **53.0** | **59.7** | 85.8 |
|
| 69 |
+
| InternLM2-Math-20B | 37.7 | 54.3 | 82.6 |
|
| 70 |
+
| InternLM2-Math-Plus-20B | **53.8** | **61.8** | **87.7** |
|
| 71 |
+
| Mixtral8x22B-Instruct-v0.1 | 41.8 | - | 78.6 |
|
| 72 |
+
| Eurux-8x22B-NCA | 49.0 | - | - |
|
| 73 |
+
| InternLM2-Math-Plus-Mixtral8x22B | **58.1** | **68.5** | **91.8** |
|
| 74 |
+
|
| 75 |
+
We also evaluate models on [MathBench-A](https://github.com/open-compass/MathBench). InternLM2-Math-Plus-Mixtral8x22B has comparable performance compared to Claude 3 Opus.
|
| 76 |
+
| Model | Arithmetic | Primary | Middle | High | College | Average |
|
| 77 |
+
| -------------------------------- | ---------- | ------- | ------ | ---- | ------- | ------- |
|
| 78 |
+
| GPT-4o-0513 | 77.7 | 87.7 | 76.3 | 59.0 | 54.0 | 70.9 |
|
| 79 |
+
| Claude 3 Opus | 85.7 | 85.0 | 58.0 | 42.7 | 43.7 | 63.0 |
|
| 80 |
+
| Qwen-Max-0428 | 72.3 | 86.3 | 65.0 | 45.0 | 27.3 | 59.2 |
|
| 81 |
+
| Qwen-1.5-110B | 70.3 | 82.3 | 64.0 | 47.3 | 28.0 | 58.4 |
|
| 82 |
+
| Deepseek-V2 | 82.7 | 89.3 | 59.0 | 39.3 | 29.3 | 59.9 |
|
| 83 |
+
| Llama-3-70B-Instruct | 70.3 | 86.0 | 53.0 | 38.7 | 34.7 | 56.5 |
|
| 84 |
+
| InternLM2-Math-Plus-Mixtral8x22B | 77.5 | 82.0 | 63.6 | 50.3 | 36.8 | 62.0 |
|
| 85 |
+
| InternLM2-Math-20B | 58.7 | 70.0 | 43.7 | 24.7 | 12.7 | 42.0 |
|
| 86 |
+
| InternLM2-Math-Plus-20B | 65.8 | 79.7 | 59.5 | 47.6 | 24.8 | 55.5 |
|
| 87 |
+
| Llama3-8B-Instruct | 54.7 | 71.0 | 25.0 | 19.0 | 14.0 | 36.7 |
|
| 88 |
+
| InternLM2-Math-7B | 53.7 | 67.0 | 41.3 | 18.3 | 8.0 | 37.7 |
|
| 89 |
+
| Deepseek-Math-7B-RL | 68.0 | 83.3 | 44.3 | 33.0 | 23.0 | 50.3 |
|
| 90 |
+
| InternLM2-Math-Plus-7B | 61.4 | 78.3 | 52.5 | 40.5 | 21.7 | 50.9 |
|
| 91 |
+
| MiniCPM-2B | 49.3 | 51.7 | 18.0 | 8.7 | 3.7 | 26.3 |
|
| 92 |
+
| InternLM2-Math-Plus-1.8B | 43.0 | 43.3 | 25.4 | 18.9 | 4.7 | 27.1 |
|
| 93 |
+
|