Update README.md
Browse filesupdate Evaluation Results
README.md
CHANGED
|
@@ -170,7 +170,7 @@ TableGPT-R1 is under apache-2.0 license.
|
|
| 170 |
|
| 171 |
**Research Paper**
|
| 172 |
|
| 173 |
-
TableGPT-R1 is introduced and validated in the paper "[TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning](https://arxiv.org/
|
| 174 |
|
| 175 |
**Where to send questions or comments about the model**
|
| 176 |
|
|
@@ -178,25 +178,38 @@ Inquiries and feedback are welcome at [j.zhao@zju.edu.cn](mailto:j.zhao@zju.edu.
|
|
| 178 |
|
| 179 |
## Evaluation Results
|
| 180 |
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
|
| 193 |
-
| |
|
| 194 |
-
| **
|
| 195 |
-
|
|
| 196 |
-
|
|
| 197 |
-
| |
|
| 198 |
-
|
|
| 199 |
-
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
## Citation
|
| 202 |
|
|
|
|
| 170 |
|
| 171 |
**Research Paper**
|
| 172 |
|
| 173 |
+
TableGPT-R1 is introduced and validated in the paper "[TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning](https://arxiv.org/abs/2512.20312)" available on arXiv.
|
| 174 |
|
| 175 |
**Where to send questions or comments about the model**
|
| 176 |
|
|
|
|
| 178 |
|
| 179 |
## Evaluation Results
|
| 180 |
|
| 181 |
+
TableGPT-R1 demonstrates substantial advancements over its predecessor, TableGPT2-7B, particularly in table comprehension and reasoning capabilities. Detailed comparisons are as follows:
|
| 182 |
+
|
| 183 |
+
* **TableBench Benchmark**: TableGPT-R1 demonstrates strong performance. It achieves an average gain of 6.9\% over the Qwen3-8B across four core sub-tasks. Compared to the TableGPT2-7B, it records an average improvement of 3.12\%, validating its enhanced reasoning capability despite a trade-off in the PoT task.
|
| 184 |
+
|
| 185 |
+
* **Natural Language to SQL**: TableGPT-R1 exhibits superior generalization capabilities. While showing consistent improvements over Qwen3-8B on Spider 1.0 (+0.66\%) and BIRD (+1.5\%), it represents a significant leap compared to TableGPT2-7B, registering dramatic performance increases of 12.35\% and 13.89\%, respectively.
|
| 186 |
+
|
| 187 |
+
* **RealHitBench Test**: In this highly challenging test, TableGPT-R1 achieved outstanding results, particularly surpassing the top closed-source baseline model GPT-4o. This highlights its powerful capabilities in hierarchical table reasoning. Quantitative analysis shows that TableGPT-R1 matches or outperforms Qwen3-8B across subtasks, achieving an average improvement of 11.81\%, with a remarkable peak gain of 31.17\% in the Chart Generation task. Furthermore, compared to TableGPT2-7B, the model represents a significant advancement, registering an average improvement of 19.85\% across all subtasks.
|
| 188 |
+
|
| 189 |
+
* **Internal Benchmark**: Evaluation further attests to the model's robustness. TableGPT-R1 surpasses Qwen3-8B by substantial margins: 10.8\% on the Table Info and 8.8\% on the Table Path.
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
| Benchmark | Task | Met. | Q3-8B | T-LLM | Llama | T-R1-Z | TGPT2 | **TGPT-R1** | Q3-14B | Q3-32B | Q3-30B | QwQ | GPT-4o | DS-V3 | Q-Plus | vs.Q3-8B | vs.TGPT2 |
|
| 193 |
+
| :--- | :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 194 |
+
| **Internal Bench** | | | | | | | | | | | | | | | | | |
|
| 195 |
+
| Table Info | | Acc | 69.20 | 0.97 | 37.26 | 15.97 | - | **80.00** | 66.10 | 72.58 | 51.10 | 69.68 | 67.26 | 66.00 | **76.90** | 10.80 | - |
|
| 196 |
+
| Table Path | | Acc | 73.90 | 0.65 | 31.77 | 9.19 | - | **82.70** | 74.70 | 78.55 | 60.50 | 75.00 | - | 72.90 | **81.50** | 8.80 | - |
|
| 197 |
+
| **NL2SQL** | | | | | | | | | | | | | | | | | |
|
| 198 |
+
| Spider | | EX | 86.07 | 65.30 | 73.59 | 82.63 | 74.38 | **86.73** | 87.61 | 87.80 | 61.71 | 85.33 | 87.98 | 88.54 | **89.19** | 0.66 | 12.35 |
|
| 199 |
+
| BIRD | | EX | 61.67 | 30.64 | 40.03 | 50.98 | 49.28 | **63.17** | 61.80 | 63.04 | 53.91 | 54.30 | 65.25 | 65.65 | **68.32** | 1.50 | 13.89 |
|
| 200 |
+
| **Holistic Table Evaluation** | | | | | | | | | | | | | | | | | |
|
| 201 |
+
| TableBench | DP | Rge | 42.10 | 3.63 | 18.04 | 39.40 | 42.10 | **48.35** | 47.41 | **52.18** | 48.61 | 49.33 | 40.91 | 36.56 | 31.01 | 6.25 | 6.25 |
|
| 202 |
+
| | PoT | Rge | 28.01 | 0.00 | 6.73 | 7.54 | **39.80** | 35.12 | 36.61 | 37.78 | 27.72 | 40.03 | **51.96** | 33.05 | 41.79 | 7.11 | -4.68 |
|
| 203 |
+
| | SCoT | Rge | 41.86 | 1.99 | 21.94 | 28.89 | 40.70 | **49.53** | 47.36 | 47.47 | 45.68 | 44.84 | 41.43 | **50.11** | 44.06 | 7.67 | 8.83 |
|
| 204 |
+
| | TCoT | Rge | 41.71 | 3.18 | 15.26 | 39.52 | 46.19 | **48.28** | 46.07 | 51.74 | 47.63 | 48.83 | 45.71 | **54.28** | 52.07 | 6.57 | 2.09 |
|
| 205 |
+
| RealHitBench | FC | EM | 58.83 | 33.44 | 30.32 | 0.00 | 43.06 | **63.85** | 62.36 | 65.00 | 60.23 | **66.31** | 55.22 | 65.08 | 56.53 | 5.01 | 20.79 |
|
| 206 |
+
| | NR | EM | 39.43 | 13.36 | 14.53 | 0.00 | 24.90 | **49.03** | 43.70 | 47.34 | 46.95 | **55.38** | 38.91 | 52.53 | 31.25 | 9.60 | 24.13 |
|
| 207 |
+
| | SC | EM | 64.12 | 53.28 | 35.90 | 28.50 | 34.86 | **64.12** | 73.02 | 71.76 | 69.47 | **76.08** | 61.83 | 71.25 | 62.85 | 0.00 | 29.26 |
|
| 208 |
+
| | DA | GPT | 53.28 | 47.86 | 60.12 | 36.24 | 53.16 | **66.53** | 63.03 | **66.67** | 53.27 | 64.99 | 55.54 | 66.29 | 62.04 | 13.25 | 13.37 |
|
| 209 |
+
| | CG | ECR | 24.67 | 22.73 | 13.64 | 16.00 | 44.16 | **55.84** | 23.38 | 25.00 | 20.78 | 20.13 | 34.42 | 18.18 | **48.05** | 31.17 | 11.68 |
|
| 210 |
+
| **Agent-based Data Analysis** | | | | | | | | | | | | | | | | | |
|
| 211 |
+
| InfiAgent-DA | | Acc | 56.81 | 11.67 | 55.08 | 70.82 | 73.15 | **80.54** | 59.92 | 54.86 | 41.63 | 37.74 | **87.10** | 77.43 | 67.32 | 23.73 | 7.39 |
|
| 212 |
+
|
| 213 |
|
| 214 |
## Citation
|
| 215 |
|