TIGER-Lab
/

TIGERScore-7B

@@ -29,12 +29,12 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
 Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
 ### Kendall Results
-| Tasks$\rightarrow$                     | Summ      | Trans     | D2T             | LF-QA     | MathQA    | Instruct  | Story-Gen | Average   |
 |----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
 |                                        |           |           | GPT-based Metrics |         |           |           |           |           |
 | GPT-3.5-turbo (few-shot)               | **30.45** | 32.3      | 30.38           | 20.91     | **58.57** | 17.73     | 3.26      | 27.65     |
 | GPT-4 (zero-shot)                      | 29.32     | **35.38** | **32.26**       | **35.85** | 46.63     | **49.5**  | **25.69** | **36.38** |
-|                                        |           |           | Reference-based Metrics |   |           |           |           |           |
 | BLEU                                   | 8.71      | 14.5      | 23.13           | 7.73      | 17.25     | 35.92     | -0.89     | 15.19     |
 | ROUGE-2f                               | 10.67     | 13.19     | 24.74           | 11.73     | 18.07     | 34.59     | 1.78      | 16.4      |
 | InstructScore                          | 20.86     | 40.44     | 30.21           | 15.64     | -3.87     | 13.87     | 13.5      | 18.66     |
@@ -45,7 +45,7 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
 | BLEURT                                 | 12.69     | 36.12     | **34.48**       | 23.11     | 2.88      | 27.94     | 19.18     | 22.34     |
 | UniEval (summ)                         | **35.89** | 16.08     | 28.56           | **29.32** | 16.15     | 11.93     | **31.22** | 24.17     |
 | COMET-22                               | 25.01     | **42.79** | 23.43           | 24.66     | -4.52     | **36.17** | 27.52     | **25.01** |
-|                                        |           |           | Reference-free  Metrics |   |           |           |           |           |
 | BARTScore-para (src-hypo)              | 29.12     | 7.01      | 22.32           | 18.8      | -2.21     | 4.26      | 14.15     | 13.35     |
 | BARTScore-cnn (src-hypo)               | 26.63     | 9.4       | 23.69           | 28.93     | 1.23      | 19.09     | 23.29     | 18.89     |
 | Llama-2-13b-chat-0-shot                | 25.22     | 11.79     | 23.45           | 15.96     | 1.08      | 19.5      | 21.52     | 16.93     |
@@ -53,8 +53,8 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
 | GPTScore-src                           | 28.2      | 6.5       | 19.81           | 27.64     | 11.64     | 20.04     | 16.36     | 18.6      |
 | TigerScore-7B                          | 28.79     | 33.65     | 32.44           | 33.93     | 19.98     | 38.13     | 29.72     | 30.95     |
 | TigerScore-13B                         | **31.29** | **36.5**  | **36.43**       | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
-| $\delta$ (ours - best reference-free)  | 2         | 0         | 13              | 4         | 10        | 15        | 14        | 15        |
-| $\delta$ (ours - best reference-based) | -4        | -6        | 2               | 4         | 2         | 5         | 4         | 8         |
 ### Pearson Results

 Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
 ### Kendall Results
+| Tasks⟶                     | Summ      | Trans     | D2T             | LF-QA     | MathQA    | Instruct  | Story-Gen | Average   |
 |----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
 |                                        |           |           | GPT-based Metrics |         |           |           |           |           |
 | GPT-3.5-turbo (few-shot)               | **30.45** | 32.3      | 30.38           | 20.91     | **58.57** | 17.73     | 3.26      | 27.65     |
 | GPT-4 (zero-shot)                      | 29.32     | **35.38** | **32.26**       | **35.85** | 46.63     | **49.5**  | **25.69** | **36.38** |
+|                                        |           |           | Reference|-based Metrics    |           |           |           |           |
 | BLEU                                   | 8.71      | 14.5      | 23.13           | 7.73      | 17.25     | 35.92     | -0.89     | 15.19     |
 | ROUGE-2f                               | 10.67     | 13.19     | 24.74           | 11.73     | 18.07     | 34.59     | 1.78      | 16.4      |
 | InstructScore                          | 20.86     | 40.44     | 30.21           | 15.64     | -3.87     | 13.87     | 13.5      | 18.66     |
 | BLEURT                                 | 12.69     | 36.12     | **34.48**       | 23.11     | 2.88      | 27.94     | 19.18     | 22.34     |
 | UniEval (summ)                         | **35.89** | 16.08     | 28.56           | **29.32** | 16.15     | 11.93     | **31.22** | 24.17     |
 | COMET-22                               | 25.01     | **42.79** | 23.43           | 24.66     | -4.52     | **36.17** | 27.52     | **25.01** |
+|                                        |           |           | Reference|-free Metrics     |           |           |           |           |
 | BARTScore-para (src-hypo)              | 29.12     | 7.01      | 22.32           | 18.8      | -2.21     | 4.26      | 14.15     | 13.35     |
 | BARTScore-cnn (src-hypo)               | 26.63     | 9.4       | 23.69           | 28.93     | 1.23      | 19.09     | 23.29     | 18.89     |
 | Llama-2-13b-chat-0-shot                | 25.22     | 11.79     | 23.45           | 15.96     | 1.08      | 19.5      | 21.52     | 16.93     |
 | GPTScore-src                           | 28.2      | 6.5       | 19.81           | 27.64     | 11.64     | 20.04     | 16.36     | 18.6      |
 | TigerScore-7B                          | 28.79     | 33.65     | 32.44           | 33.93     | 19.98     | 38.13     | 29.72     | 30.95     |
 | TigerScore-13B                         | **31.29** | **36.5**  | **36.43**       | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
+| ∆ (ours - best reference-free)  | 2         | 0         | 13              | 4         | 10        | 15        | 14        | 15        |
+| ∆ (ours - best reference-based) | -4        | -6        | 2               | 4         | 2         | 5         | 4         | 8         |
 ### Pearson Results