Dongfu Jiang
commited on
Commit
·
be15112
1
Parent(s):
8a4150c
Update README.md
Browse files
README.md
CHANGED
|
@@ -29,12 +29,12 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
|
|
| 29 |
Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
|
| 30 |
|
| 31 |
### Kendall Results
|
| 32 |
-
| Tasks⟶
|
| 33 |
|----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
|
| 34 |
-
| | | | GPT-based Metrics
|
| 35 |
| GPT-3.5-turbo (few-shot) | **30.45** | 32.3 | 30.38 | 20.91 | **58.57** | 17.73 | 3.26 | 27.65 |
|
| 36 |
| GPT-4 (zero-shot) | 29.32 | **35.38** | **32.26** | **35.85** | 46.63 | **49.5** | **25.69** | **36.38** |
|
| 37 |
-
| | | | Reference
|
| 38 |
| BLEU | 8.71 | 14.5 | 23.13 | 7.73 | 17.25 | 35.92 | -0.89 | 15.19 |
|
| 39 |
| ROUGE-2f | 10.67 | 13.19 | 24.74 | 11.73 | 18.07 | 34.59 | 1.78 | 16.4 |
|
| 40 |
| InstructScore | 20.86 | 40.44 | 30.21 | 15.64 | -3.87 | 13.87 | 13.5 | 18.66 |
|
|
@@ -45,7 +45,7 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
|
|
| 45 |
| BLEURT | 12.69 | 36.12 | **34.48** | 23.11 | 2.88 | 27.94 | 19.18 | 22.34 |
|
| 46 |
| UniEval (summ) | **35.89** | 16.08 | 28.56 | **29.32** | 16.15 | 11.93 | **31.22** | 24.17 |
|
| 47 |
| COMET-22 | 25.01 | **42.79** | 23.43 | 24.66 | -4.52 | **36.17** | 27.52 | **25.01** |
|
| 48 |
-
| | | | Reference
|
| 49 |
| BARTScore-para (src-hypo) | 29.12 | 7.01 | 22.32 | 18.8 | -2.21 | 4.26 | 14.15 | 13.35 |
|
| 50 |
| BARTScore-cnn (src-hypo) | 26.63 | 9.4 | 23.69 | 28.93 | 1.23 | 19.09 | 23.29 | 18.89 |
|
| 51 |
| Llama-2-13b-chat-0-shot | 25.22 | 11.79 | 23.45 | 15.96 | 1.08 | 19.5 | 21.52 | 16.93 |
|
|
@@ -53,17 +53,46 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
|
|
| 53 |
| GPTScore-src | 28.2 | 6.5 | 19.81 | 27.64 | 11.64 | 20.04 | 16.36 | 18.6 |
|
| 54 |
| TigerScore-7B | 28.79 | 33.65 | 32.44 | 33.93 | 19.98 | 38.13 | 29.72 | 30.95 |
|
| 55 |
| TigerScore-13B | **31.29** | **36.5** | **36.43** | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
|
| 56 |
-
| ∆ (ours - best reference-free) | 2 | 0 | 13 | 4 | 10 | 15 | 14 | 15 |
|
| 57 |
-
| ∆ (ours - best reference-based) | -4 | -6 | 2 | 4 | 2 | 5 | 4 | 8 |
|
| 58 |
|
| 59 |
### Pearson Results
|
| 60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
### Spearman Results
|
| 62 |
|
| 63 |
-
| Tasks
|
| 64 |
|-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
|
|
|
|
| 65 |
| GPT-3.5-turbo (few-shot) | **38.50** | 40.53 | 40.20 | 29.33 | **66.46** | 23.20 | 4.77 | 34.71 |
|
| 66 |
| GPT-4 (zero-shot) | 36.46 | **43.87** | **44.04** | **48.95** | 51.71 | **58.53** | **32.48** | **45.15** |
|
|
|
|
| 67 |
| BLEU | 11.98 | 19.73 | 33.29 | 11.38 | 21.12 | **46.61** | -1.17 | 20.42 |
|
| 68 |
| ROUGE-2f | 14.53 | 17.83 | 35.49 | 16.83 | 22.12 | 44.56 | 2.34 | 21.96 |
|
| 69 |
| InstructScore | 26.33 | 47.30 | 43.93 | 21.62 | -4.15 | 16.19 | 16.13 | 23.91 |
|
|
@@ -74,6 +103,7 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
|
|
| 74 |
| BLEURT | 17.30 | 48.41 | **48.76** | 33.26 | 3.53 | 36.46 | 27.52 | 30.75 |
|
| 75 |
| UniEval(summ) | **47.52** | 21.90 | 38.38 | **41.83** | 19.78 | 16.02 | **44.46** | 32.84 |
|
| 76 |
| COMET-22 | 33.75 | **56.35** | 33.92 | 35.28 | -5.53 | 46.13 | 39.20 | **34.16** |
|
|
|
|
| 77 |
| BARTScore-para (src-hypo) | **38.68** | 9.60 | 32.26 | 26.86 | -2.70 | 5.92 | 20.55 | 18.74 |
|
| 78 |
| BARTScore-cnn (src-hypo) | 35.50 | 12.83 | 34.33 | 40.96 | 1.50 | 25.43 | 33.48 | 26.29 |
|
| 79 |
| Llama-2-13b-chat-0-shot | 28.53 | 14.38 | 29.24 | 19.91 | 1.08 | 21.37 | 26.78 | 20.18 |
|
|
@@ -82,6 +112,7 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
|
|
| 82 |
| TIGERScore-7B (ours) | 35.11 | 41.50 | 42.39 | **47.11** | 21.23 | 43.57 | 39.26 | 38.60 |
|
| 83 |
| TIGERScore-13B (ours) | 36.81 | 44.99 | **45.88** | 46.22 | **23.32** | **47.03** | **46.36** | **41.52** |
|
| 84 |
| Δ (ours - best reference-free) | -2 | -3 | +12 | +5 | +9 | +14 | +13 | +16 |
|
|
|
|
| 85 |
|
| 86 |
## Usage
|
| 87 |
|
|
|
|
| 29 |
Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
|
| 30 |
|
| 31 |
### Kendall Results
|
| 32 |
+
| Tasks⟶ | Summarization | Translation | Data2Text | Long-form QA | MathQA | Instruction Following | Story-Gen | Average |
|
| 33 |
|----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
|
| 34 |
+
| | | | GPT-based | Metrics | | | | |
|
| 35 |
| GPT-3.5-turbo (few-shot) | **30.45** | 32.3 | 30.38 | 20.91 | **58.57** | 17.73 | 3.26 | 27.65 |
|
| 36 |
| GPT-4 (zero-shot) | 29.32 | **35.38** | **32.26** | **35.85** | 46.63 | **49.5** | **25.69** | **36.38** |
|
| 37 |
+
| | | | Reference-based | Metrics | | | | |
|
| 38 |
| BLEU | 8.71 | 14.5 | 23.13 | 7.73 | 17.25 | 35.92 | -0.89 | 15.19 |
|
| 39 |
| ROUGE-2f | 10.67 | 13.19 | 24.74 | 11.73 | 18.07 | 34.59 | 1.78 | 16.4 |
|
| 40 |
| InstructScore | 20.86 | 40.44 | 30.21 | 15.64 | -3.87 | 13.87 | 13.5 | 18.66 |
|
|
|
|
| 45 |
| BLEURT | 12.69 | 36.12 | **34.48** | 23.11 | 2.88 | 27.94 | 19.18 | 22.34 |
|
| 46 |
| UniEval (summ) | **35.89** | 16.08 | 28.56 | **29.32** | 16.15 | 11.93 | **31.22** | 24.17 |
|
| 47 |
| COMET-22 | 25.01 | **42.79** | 23.43 | 24.66 | -4.52 | **36.17** | 27.52 | **25.01** |
|
| 48 |
+
| | | | Reference-free |Metrics | | | | |
|
| 49 |
| BARTScore-para (src-hypo) | 29.12 | 7.01 | 22.32 | 18.8 | -2.21 | 4.26 | 14.15 | 13.35 |
|
| 50 |
| BARTScore-cnn (src-hypo) | 26.63 | 9.4 | 23.69 | 28.93 | 1.23 | 19.09 | 23.29 | 18.89 |
|
| 51 |
| Llama-2-13b-chat-0-shot | 25.22 | 11.79 | 23.45 | 15.96 | 1.08 | 19.5 | 21.52 | 16.93 |
|
|
|
|
| 53 |
| GPTScore-src | 28.2 | 6.5 | 19.81 | 27.64 | 11.64 | 20.04 | 16.36 | 18.6 |
|
| 54 |
| TigerScore-7B | 28.79 | 33.65 | 32.44 | 33.93 | 19.98 | 38.13 | 29.72 | 30.95 |
|
| 55 |
| TigerScore-13B | **31.29** | **36.5** | **36.43** | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
|
| 56 |
+
| ∆ (ours - best reference-free) | +2 | +0 | +13 | +4 | +10 | +15 | +14 | +15 |
|
| 57 |
+
| ∆ (ours - best reference-based) | -4 | -6 | +2 | +4 | +2 | +5 | +4 | +8 |
|
| 58 |
|
| 59 |
### Pearson Results
|
| 60 |
|
| 61 |
+
| Tasks⟶ | Summarization | Translation | Data2Text | Long-form QA | MathQA | Instruction Following | Story-Gen | Average |
|
| 62 |
+
|-------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
|
| 63 |
+
| | | | GPT-based | Metrics | | | | |
|
| 64 |
+
| GPT-3.5-turbo (few-shot) | **45.53** | **43.77** | **47.76** | 29.84 | **61.26** | 15.36 | 7.8 | 35.9 |
|
| 65 |
+
| GPT-4 (zero-shot) | 40.75 | 33.92 | 46.83 | **49.3** | 54.98 | **60.45** | **37.74** | **46.28** |
|
| 66 |
+
| | | | Reference-based | Metrics | | | | |
|
| 67 |
+
| BLEU | 11.66 | 17.47 | 34.29 | 18.21 | 18.12 | 29.47 | -0.64 | 18.37 |
|
| 68 |
+
| ROUGE-2f | 16.03 | 16.26 | 35.85 | 19.66 | 20.69 | 33.49 | 2.88 | 20.69 |
|
| 69 |
+
| InstructScore | 27.4 | 51.55 | 47.28 | 20.59 | 0.36 | 20.98 | 12.81 | 25.85 |
|
| 70 |
+
| GPTScore-ref | 13.47 | 21.05 | 48.7 | 33.4 | 18.22 | 29.66 | 18.94 | 26.2 |
|
| 71 |
+
| BARTScore-cnn (hypo-ref) | 16.67 | 23.56 | 45.08 | 32.78 | **23.09** | 26.57 | 27.61 | 27.91 |
|
| 72 |
+
| BARTScore-para (hypo-ref) | 19.73 | 29.04 | 47.89 | 32.7 | 17.33 | 30.2 | 17.76 | 27.81 |
|
| 73 |
+
| BERTScore | 26.26 | 37.65 | 48.22 | 26.39 | 11.19 | 45.58 | 4.08 | 28.48 |
|
| 74 |
+
| BLEURT | 17.27 | 43 | **54.32** | 34.26 | 3.98 | 39.15 | 27.89 | 31.41 |
|
| 75 |
+
| UniEval (summ) | **53.22** | 23.11 | 51.14 | **36.95** | 17.69 | 30.87 | **44.88** | 36.84 |
|
| 76 |
+
| COMET-22 | 35.32 | **58.46** | 43.82 | 36.79 | -5.58 | **49.68** | 40.12 | **36.94** |
|
| 77 |
+
| | | | Reference-free | Metrics | | | | |
|
| 78 |
+
| BARTScore-para (src-hypo) | 43.11 | 6.96 | 37.82 | 29.86 | -0.41 | 19.37 | 19.99 | 22.38 |
|
| 79 |
+
| BARTScore-cnn (src-hypo) | 39.72 | 9.53 | 45.43 | 41.48 | 3.28 | 34.97 | 33.51 | 29.7 |
|
| 80 |
+
| Llama-2-13b-chat-0-shot | 29.59 | 9.09 | 41.32 | 21.67 | 2.8 | 22.71 | 21.13 | 21.19 |
|
| 81 |
+
| COMETKiwi | 14.22 | **50.91** | 23.63 | 22.59 | -13.35 | 34.46 | 19.12 | 21.65 |
|
| 82 |
+
| GPTScore-src | 41.71 | 6.82 | 41.19 | 39.79 | 13.99 | 27.59 | 23.22 | 27.76 |
|
| 83 |
+
| TigerScore-7B | 43.95 | 37.7 | 49.13 | **46.1** | 21.77 | 38.26 | 39.9 | 39.54 |
|
| 84 |
+
| TigerScore-13B | **44.21** | 41.54 | **52.87** | 44.76 | **24.41** | **47.52** | **47.66** | **43.28** |
|
| 85 |
+
| ∆ (ours - best reference-free) | +1 | -9 | +7 | +5 | +10 | +20 | +14 | +13 |
|
| 86 |
+
| ∆ (ours - best reference-based) | -9 | -17 | -2 | +9 | +1 | -2 | +3 | +6 |
|
| 87 |
+
|
| 88 |
### Spearman Results
|
| 89 |
|
| 90 |
+
| Tasks⟶ | Summarization | Translation | Data2Text | Long-form QA | MathQA | Instruction Following | Story-Gen | Average |
|
| 91 |
|-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
|
| 92 |
+
| | | | GPT-based | Metrics | | | | |
|
| 93 |
| GPT-3.5-turbo (few-shot) | **38.50** | 40.53 | 40.20 | 29.33 | **66.46** | 23.20 | 4.77 | 34.71 |
|
| 94 |
| GPT-4 (zero-shot) | 36.46 | **43.87** | **44.04** | **48.95** | 51.71 | **58.53** | **32.48** | **45.15** |
|
| 95 |
+
| | | | Reference-based | Metrics | | | | |
|
| 96 |
| BLEU | 11.98 | 19.73 | 33.29 | 11.38 | 21.12 | **46.61** | -1.17 | 20.42 |
|
| 97 |
| ROUGE-2f | 14.53 | 17.83 | 35.49 | 16.83 | 22.12 | 44.56 | 2.34 | 21.96 |
|
| 98 |
| InstructScore | 26.33 | 47.30 | 43.93 | 21.62 | -4.15 | 16.19 | 16.13 | 23.91 |
|
|
|
|
| 103 |
| BLEURT | 17.30 | 48.41 | **48.76** | 33.26 | 3.53 | 36.46 | 27.52 | 30.75 |
|
| 104 |
| UniEval(summ) | **47.52** | 21.90 | 38.38 | **41.83** | 19.78 | 16.02 | **44.46** | 32.84 |
|
| 105 |
| COMET-22 | 33.75 | **56.35** | 33.92 | 35.28 | -5.53 | 46.13 | 39.20 | **34.16** |
|
| 106 |
+
| | | | Reference-free | Metrics | | | | |
|
| 107 |
| BARTScore-para (src-hypo) | **38.68** | 9.60 | 32.26 | 26.86 | -2.70 | 5.92 | 20.55 | 18.74 |
|
| 108 |
| BARTScore-cnn (src-hypo) | 35.50 | 12.83 | 34.33 | 40.96 | 1.50 | 25.43 | 33.48 | 26.29 |
|
| 109 |
| Llama-2-13b-chat-0-shot | 28.53 | 14.38 | 29.24 | 19.91 | 1.08 | 21.37 | 26.78 | 20.18 |
|
|
|
|
| 112 |
| TIGERScore-7B (ours) | 35.11 | 41.50 | 42.39 | **47.11** | 21.23 | 43.57 | 39.26 | 38.60 |
|
| 113 |
| TIGERScore-13B (ours) | 36.81 | 44.99 | **45.88** | 46.22 | **23.32** | **47.03** | **46.36** | **41.52** |
|
| 114 |
| Δ (ours - best reference-free) | -2 | -3 | +12 | +5 | +9 | +14 | +13 | +16 |
|
| 115 |
+
| ∆ (ours - best reference-based) | -9 | -11 | -3 | +5 | -0 | +0 | +2 | +7 |
|
| 116 |
|
| 117 |
## Usage
|
| 118 |
|