yangsaisai commited on
Commit
efa7197
·
verified ·
1 Parent(s): ae16a45

Update README.md

Browse files

update Evaluation Results

Files changed (1) hide show
  1. README.md +33 -20
README.md CHANGED
@@ -170,7 +170,7 @@ TableGPT-R1 is under apache-2.0 license.
170
 
171
  **Research Paper**
172
 
173
- TableGPT-R1 is introduced and validated in the paper "[TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning](https://arxiv.org/xxxx)" available on arXiv.
174
 
175
  **Where to send questions or comments about the model**
176
 
@@ -178,25 +178,38 @@ Inquiries and feedback are welcome at [j.zhao@zju.edu.cn](mailto:j.zhao@zju.edu.
178
 
179
  ## Evaluation Results
180
 
181
- Performance comparison grouped by model scale. Left Group: Models with comparable
182
- parameters to TableGPT-R1. Right Group: Significantly larger models and proprietary closed-source
183
- models. Bold indicates the best result within each group. Gray background highlights TableGPT-R1. Abbreviations: Q3: Qwen3; QwQ: QwQ-32B; DS-V3: DeepSeek-V3; Q-Plus: Qwen-Plus;
184
- T-LLM: TableLLM; Llama: Llama-3.1-8B; TGPT2: TableGPT2-7B; TGPT-R1: TableGPT-R1-8B;
185
- FC: Fact Checking; NR: Numerical Reasoning; SC: Structure Comprehending; DA: Data Analysis;
186
- CG: Chart Generation.
187
-
188
- | Benchmark | Task | Met. | Q3-8B | T-LLM | Llama | TGPT2 | **TGPT-R1 (8B)** | Q3-14B | Q3-32B | Q3-70B | QwQ | GPT-4o | DS-V3 | Q-Plus |
189
- | :--- | :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
190
- | **Internal Bench** | Table Info | Acc | 69.20 | 0.97 | 37.26 | - | **82.00** | 66.10 | 72.58 | 51.10 | 69.68 | 67.26 | 66.00 | **76.90** |
191
- | | Table Path | Acc | 73.90 | 0.65 | 31.77 | - | **85.00** | 74.70 | 78.55 | 60.50 | 75.00 | - | 72.90 | **81.50** |
192
- | **NL2SQL** | Spider | EX | 86.07 | 65.30 | 73.59 | 74.38 | **86.73** | 87.61 | 87.80 | 61.71 | 85.33 | 87.98 | 88.54 | **89.19** |
193
- | | BIRD | EX | 61.67 | 30.64 | 40.03 | 49.28 | **63.04** | 61.80 | 63.04 | 53.91 | 54.30 | 65.25 | 65.65 | **68.32** |
194
- | **Holistic Table** | TableBench DP | Rge | 42.10 | 3.63 | 18.04 | 42.10 | **47.58** | 47.41 | **52.18** | 48.61 | 49.33 | 40.91 | 36.56 | 31.01 |
195
- | **Evaluation** | PoT | Rge | 28.01 | 0.00 | 6.73 | **39.80** | 34.86 | 36.61 | 37.78 | 27.72 | 40.03 | **51.96** | 33.05 | 41.79 |
196
- | | SCoT | Rge | 41.86 | 1.99 | 21.94 | 40.70 | **48.68** | 47.36 | 47.47 | 45.68 | 44.84 | 41.43 | **50.11** | 44.06 |
197
- | | TCoT | Rge | 41.71 | 3.18 | 15.26 | 46.19 | **48.16** | 46.07 | 51.74 | 47.63 | 48.83 | 45.71 | **54.28** | 52.07 |
198
- | **RealHitBench** | FC | EM | 58.83 | 33.44 | 30.32 | 43.06 | **62.85** | 62.36 | **65.00** | 60.23 | 28.95 | 55.22 | **65.08** | 56.53 |
199
- | | NR | EM | 39.43 | 13.51 | 18.25 | 31.75 | **44.91** | 45.62 | 47.45 | 42.82 | 42.61 | 48.66 | **53.89** | 49.88 |
 
 
 
 
 
 
 
 
 
 
 
 
 
200
 
201
  ## Citation
202
 
 
170
 
171
  **Research Paper**
172
 
173
+ TableGPT-R1 is introduced and validated in the paper "[TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning](https://arxiv.org/abs/2512.20312)" available on arXiv.
174
 
175
  **Where to send questions or comments about the model**
176
 
 
178
 
179
  ## Evaluation Results
180
 
181
+ TableGPT-R1 demonstrates substantial advancements over its predecessor, TableGPT2-7B, particularly in table comprehension and reasoning capabilities. Detailed comparisons are as follows:
182
+
183
+ * **TableBench Benchmark**: TableGPT-R1 demonstrates strong performance. It achieves an average gain of 6.9\% over the Qwen3-8B across four core sub-tasks. Compared to the TableGPT2-7B, it records an average improvement of 3.12\%, validating its enhanced reasoning capability despite a trade-off in the PoT task.
184
+
185
+ * **Natural Language to SQL**: TableGPT-R1 exhibits superior generalization capabilities. While showing consistent improvements over Qwen3-8B on Spider 1.0 (+0.66\%) and BIRD (+1.5\%), it represents a significant leap compared to TableGPT2-7B, registering dramatic performance increases of 12.35\% and 13.89\%, respectively.
186
+
187
+ * **RealHitBench Test**: In this highly challenging test, TableGPT-R1 achieved outstanding results, particularly surpassing the top closed-source baseline model GPT-4o. This highlights its powerful capabilities in hierarchical table reasoning. Quantitative analysis shows that TableGPT-R1 matches or outperforms Qwen3-8B across subtasks, achieving an average improvement of 11.81\%, with a remarkable peak gain of 31.17\% in the Chart Generation task. Furthermore, compared to TableGPT2-7B, the model represents a significant advancement, registering an average improvement of 19.85\% across all subtasks.
188
+
189
+ * **Internal Benchmark**: Evaluation further attests to the model's robustness. TableGPT-R1 surpasses Qwen3-8B by substantial margins: 10.8\% on the Table Info and 8.8\% on the Table Path.
190
+
191
+
192
+ | Benchmark | Task | Met. | Q3-8B | T-LLM | Llama | T-R1-Z | TGPT2 | **TGPT-R1** | Q3-14B | Q3-32B | Q3-30B | QwQ | GPT-4o | DS-V3 | Q-Plus | vs.Q3-8B | vs.TGPT2 |
193
+ | :--- | :--- | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
194
+ | **Internal Bench** | | | | | | | | | | | | | | | | | |
195
+ | Table Info | | Acc | 69.20 | 0.97 | 37.26 | 15.97 | - | **80.00** | 66.10 | 72.58 | 51.10 | 69.68 | 67.26 | 66.00 | **76.90** | 10.80 | - |
196
+ | Table Path | | Acc | 73.90 | 0.65 | 31.77 | 9.19 | - | **82.70** | 74.70 | 78.55 | 60.50 | 75.00 | - | 72.90 | **81.50** | 8.80 | - |
197
+ | **NL2SQL** | | | | | | | | | | | | | | | | | |
198
+ | Spider | | EX | 86.07 | 65.30 | 73.59 | 82.63 | 74.38 | **86.73** | 87.61 | 87.80 | 61.71 | 85.33 | 87.98 | 88.54 | **89.19** | 0.66 | 12.35 |
199
+ | BIRD | | EX | 61.67 | 30.64 | 40.03 | 50.98 | 49.28 | **63.17** | 61.80 | 63.04 | 53.91 | 54.30 | 65.25 | 65.65 | **68.32** | 1.50 | 13.89 |
200
+ | **Holistic Table Evaluation** | | | | | | | | | | | | | | | | | |
201
+ | TableBench | DP | Rge | 42.10 | 3.63 | 18.04 | 39.40 | 42.10 | **48.35** | 47.41 | **52.18** | 48.61 | 49.33 | 40.91 | 36.56 | 31.01 | 6.25 | 6.25 |
202
+ | | PoT | Rge | 28.01 | 0.00 | 6.73 | 7.54 | **39.80** | 35.12 | 36.61 | 37.78 | 27.72 | 40.03 | **51.96** | 33.05 | 41.79 | 7.11 | -4.68 |
203
+ | | SCoT | Rge | 41.86 | 1.99 | 21.94 | 28.89 | 40.70 | **49.53** | 47.36 | 47.47 | 45.68 | 44.84 | 41.43 | **50.11** | 44.06 | 7.67 | 8.83 |
204
+ | | TCoT | Rge | 41.71 | 3.18 | 15.26 | 39.52 | 46.19 | **48.28** | 46.07 | 51.74 | 47.63 | 48.83 | 45.71 | **54.28** | 52.07 | 6.57 | 2.09 |
205
+ | RealHitBench | FC | EM | 58.83 | 33.44 | 30.32 | 0.00 | 43.06 | **63.85** | 62.36 | 65.00 | 60.23 | **66.31** | 55.22 | 65.08 | 56.53 | 5.01 | 20.79 |
206
+ | | NR | EM | 39.43 | 13.36 | 14.53 | 0.00 | 24.90 | **49.03** | 43.70 | 47.34 | 46.95 | **55.38** | 38.91 | 52.53 | 31.25 | 9.60 | 24.13 |
207
+ | | SC | EM | 64.12 | 53.28 | 35.90 | 28.50 | 34.86 | **64.12** | 73.02 | 71.76 | 69.47 | **76.08** | 61.83 | 71.25 | 62.85 | 0.00 | 29.26 |
208
+ | | DA | GPT | 53.28 | 47.86 | 60.12 | 36.24 | 53.16 | **66.53** | 63.03 | **66.67** | 53.27 | 64.99 | 55.54 | 66.29 | 62.04 | 13.25 | 13.37 |
209
+ | | CG | ECR | 24.67 | 22.73 | 13.64 | 16.00 | 44.16 | **55.84** | 23.38 | 25.00 | 20.78 | 20.13 | 34.42 | 18.18 | **48.05** | 31.17 | 11.68 |
210
+ | **Agent-based Data Analysis** | | | | | | | | | | | | | | | | | |
211
+ | InfiAgent-DA | | Acc | 56.81 | 11.67 | 55.08 | 70.82 | 73.15 | **80.54** | 59.92 | 54.86 | 41.63 | 37.74 | **87.10** | 77.43 | 67.32 | 23.73 | 7.39 |
212
+
213
 
214
  ## Citation
215