Dongfu Jiang
commited on
Commit
·
8a4150c
1
Parent(s):
3c25291
Update README.md
Browse files
README.md
CHANGED
|
@@ -29,12 +29,12 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
|
|
| 29 |
Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
|
| 30 |
|
| 31 |
### Kendall Results
|
| 32 |
-
| Tasks
|
| 33 |
|----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
|
| 34 |
| | | | GPT-based Metrics | | | | | |
|
| 35 |
| GPT-3.5-turbo (few-shot) | **30.45** | 32.3 | 30.38 | 20.91 | **58.57** | 17.73 | 3.26 | 27.65 |
|
| 36 |
| GPT-4 (zero-shot) | 29.32 | **35.38** | **32.26** | **35.85** | 46.63 | **49.5** | **25.69** | **36.38** |
|
| 37 |
-
| | | | Reference
|
| 38 |
| BLEU | 8.71 | 14.5 | 23.13 | 7.73 | 17.25 | 35.92 | -0.89 | 15.19 |
|
| 39 |
| ROUGE-2f | 10.67 | 13.19 | 24.74 | 11.73 | 18.07 | 34.59 | 1.78 | 16.4 |
|
| 40 |
| InstructScore | 20.86 | 40.44 | 30.21 | 15.64 | -3.87 | 13.87 | 13.5 | 18.66 |
|
|
@@ -45,7 +45,7 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
|
|
| 45 |
| BLEURT | 12.69 | 36.12 | **34.48** | 23.11 | 2.88 | 27.94 | 19.18 | 22.34 |
|
| 46 |
| UniEval (summ) | **35.89** | 16.08 | 28.56 | **29.32** | 16.15 | 11.93 | **31.22** | 24.17 |
|
| 47 |
| COMET-22 | 25.01 | **42.79** | 23.43 | 24.66 | -4.52 | **36.17** | 27.52 | **25.01** |
|
| 48 |
-
| | | | Reference
|
| 49 |
| BARTScore-para (src-hypo) | 29.12 | 7.01 | 22.32 | 18.8 | -2.21 | 4.26 | 14.15 | 13.35 |
|
| 50 |
| BARTScore-cnn (src-hypo) | 26.63 | 9.4 | 23.69 | 28.93 | 1.23 | 19.09 | 23.29 | 18.89 |
|
| 51 |
| Llama-2-13b-chat-0-shot | 25.22 | 11.79 | 23.45 | 15.96 | 1.08 | 19.5 | 21.52 | 16.93 |
|
|
@@ -53,8 +53,8 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
|
|
| 53 |
| GPTScore-src | 28.2 | 6.5 | 19.81 | 27.64 | 11.64 | 20.04 | 16.36 | 18.6 |
|
| 54 |
| TigerScore-7B | 28.79 | 33.65 | 32.44 | 33.93 | 19.98 | 38.13 | 29.72 | 30.95 |
|
| 55 |
| TigerScore-13B | **31.29** | **36.5** | **36.43** | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
|
| 59 |
### Pearson Results
|
| 60 |
|
|
|
|
| 29 |
Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
|
| 30 |
|
| 31 |
### Kendall Results
|
| 32 |
+
| Tasks⟶ | Summ | Trans | D2T | LF-QA | MathQA | Instruct | Story-Gen | Average |
|
| 33 |
|----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
|
| 34 |
| | | | GPT-based Metrics | | | | | |
|
| 35 |
| GPT-3.5-turbo (few-shot) | **30.45** | 32.3 | 30.38 | 20.91 | **58.57** | 17.73 | 3.26 | 27.65 |
|
| 36 |
| GPT-4 (zero-shot) | 29.32 | **35.38** | **32.26** | **35.85** | 46.63 | **49.5** | **25.69** | **36.38** |
|
| 37 |
+
| | | | Reference|-based Metrics | | | | |
|
| 38 |
| BLEU | 8.71 | 14.5 | 23.13 | 7.73 | 17.25 | 35.92 | -0.89 | 15.19 |
|
| 39 |
| ROUGE-2f | 10.67 | 13.19 | 24.74 | 11.73 | 18.07 | 34.59 | 1.78 | 16.4 |
|
| 40 |
| InstructScore | 20.86 | 40.44 | 30.21 | 15.64 | -3.87 | 13.87 | 13.5 | 18.66 |
|
|
|
|
| 45 |
| BLEURT | 12.69 | 36.12 | **34.48** | 23.11 | 2.88 | 27.94 | 19.18 | 22.34 |
|
| 46 |
| UniEval (summ) | **35.89** | 16.08 | 28.56 | **29.32** | 16.15 | 11.93 | **31.22** | 24.17 |
|
| 47 |
| COMET-22 | 25.01 | **42.79** | 23.43 | 24.66 | -4.52 | **36.17** | 27.52 | **25.01** |
|
| 48 |
+
| | | | Reference|-free Metrics | | | | |
|
| 49 |
| BARTScore-para (src-hypo) | 29.12 | 7.01 | 22.32 | 18.8 | -2.21 | 4.26 | 14.15 | 13.35 |
|
| 50 |
| BARTScore-cnn (src-hypo) | 26.63 | 9.4 | 23.69 | 28.93 | 1.23 | 19.09 | 23.29 | 18.89 |
|
| 51 |
| Llama-2-13b-chat-0-shot | 25.22 | 11.79 | 23.45 | 15.96 | 1.08 | 19.5 | 21.52 | 16.93 |
|
|
|
|
| 53 |
| GPTScore-src | 28.2 | 6.5 | 19.81 | 27.64 | 11.64 | 20.04 | 16.36 | 18.6 |
|
| 54 |
| TigerScore-7B | 28.79 | 33.65 | 32.44 | 33.93 | 19.98 | 38.13 | 29.72 | 30.95 |
|
| 55 |
| TigerScore-13B | **31.29** | **36.5** | **36.43** | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
|
| 56 |
+
| ∆ (ours - best reference-free) | 2 | 0 | 13 | 4 | 10 | 15 | 14 | 15 |
|
| 57 |
+
| ∆ (ours - best reference-based) | -4 | -6 | 2 | 4 | 2 | 5 | 4 | 8 |
|
| 58 |
|
| 59 |
### Pearson Results
|
| 60 |
|