Dongfu Jiang commited on
Commit
8a4150c
·
1 Parent(s): 3c25291

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -29,12 +29,12 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
29
  Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
30
 
31
  ### Kendall Results
32
- | Tasks$\rightarrow$ | Summ | Trans | D2T | LF-QA | MathQA | Instruct | Story-Gen | Average |
33
  |----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
34
  | | | | GPT-based Metrics | | | | | |
35
  | GPT-3.5-turbo (few-shot) | **30.45** | 32.3 | 30.38 | 20.91 | **58.57** | 17.73 | 3.26 | 27.65 |
36
  | GPT-4 (zero-shot) | 29.32 | **35.38** | **32.26** | **35.85** | 46.63 | **49.5** | **25.69** | **36.38** |
37
- | | | | Reference-based Metrics | | | | | |
38
  | BLEU | 8.71 | 14.5 | 23.13 | 7.73 | 17.25 | 35.92 | -0.89 | 15.19 |
39
  | ROUGE-2f | 10.67 | 13.19 | 24.74 | 11.73 | 18.07 | 34.59 | 1.78 | 16.4 |
40
  | InstructScore | 20.86 | 40.44 | 30.21 | 15.64 | -3.87 | 13.87 | 13.5 | 18.66 |
@@ -45,7 +45,7 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
45
  | BLEURT | 12.69 | 36.12 | **34.48** | 23.11 | 2.88 | 27.94 | 19.18 | 22.34 |
46
  | UniEval (summ) | **35.89** | 16.08 | 28.56 | **29.32** | 16.15 | 11.93 | **31.22** | 24.17 |
47
  | COMET-22 | 25.01 | **42.79** | 23.43 | 24.66 | -4.52 | **36.17** | 27.52 | **25.01** |
48
- | | | | Reference-free Metrics | | | | | |
49
  | BARTScore-para (src-hypo) | 29.12 | 7.01 | 22.32 | 18.8 | -2.21 | 4.26 | 14.15 | 13.35 |
50
  | BARTScore-cnn (src-hypo) | 26.63 | 9.4 | 23.69 | 28.93 | 1.23 | 19.09 | 23.29 | 18.89 |
51
  | Llama-2-13b-chat-0-shot | 25.22 | 11.79 | 23.45 | 15.96 | 1.08 | 19.5 | 21.52 | 16.93 |
@@ -53,8 +53,8 @@ Experiments show that TIGERScore surpasses existing baseline metrics in correlat
53
  | GPTScore-src | 28.2 | 6.5 | 19.81 | 27.64 | 11.64 | 20.04 | 16.36 | 18.6 |
54
  | TigerScore-7B | 28.79 | 33.65 | 32.44 | 33.93 | 19.98 | 38.13 | 29.72 | 30.95 |
55
  | TigerScore-13B | **31.29** | **36.5** | **36.43** | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
56
- | $\delta$ (ours - best reference-free) | 2 | 0 | 13 | 4 | 10 | 15 | 14 | 15 |
57
- | $\delta$ (ours - best reference-based) | -4 | -6 | 2 | 4 | 2 | 5 | 4 | 8 |
58
 
59
  ### Pearson Results
60
 
 
29
  Experiments show that TIGERScore surpasses existing baseline metrics in correlation with human ratings on all 6 held-in tasks and 1 held-out task, achiving the highest overall performance. We hope the emergence of TIGERScore can promote the research in the LLM community as a powerful, interpretable, and easy-to-use metric.
30
 
31
  ### Kendall Results
32
+ | Tasks | Summ | Trans | D2T | LF-QA | MathQA | Instruct | Story-Gen | Average |
33
  |----------------------------------------|-----------|-----------|-----------------|-----------|-----------|-----------|-----------|-----------|
34
  | | | | GPT-based Metrics | | | | | |
35
  | GPT-3.5-turbo (few-shot) | **30.45** | 32.3 | 30.38 | 20.91 | **58.57** | 17.73 | 3.26 | 27.65 |
36
  | GPT-4 (zero-shot) | 29.32 | **35.38** | **32.26** | **35.85** | 46.63 | **49.5** | **25.69** | **36.38** |
37
+ | | | | Reference|-based Metrics | | | | |
38
  | BLEU | 8.71 | 14.5 | 23.13 | 7.73 | 17.25 | 35.92 | -0.89 | 15.19 |
39
  | ROUGE-2f | 10.67 | 13.19 | 24.74 | 11.73 | 18.07 | 34.59 | 1.78 | 16.4 |
40
  | InstructScore | 20.86 | 40.44 | 30.21 | 15.64 | -3.87 | 13.87 | 13.5 | 18.66 |
 
45
  | BLEURT | 12.69 | 36.12 | **34.48** | 23.11 | 2.88 | 27.94 | 19.18 | 22.34 |
46
  | UniEval (summ) | **35.89** | 16.08 | 28.56 | **29.32** | 16.15 | 11.93 | **31.22** | 24.17 |
47
  | COMET-22 | 25.01 | **42.79** | 23.43 | 24.66 | -4.52 | **36.17** | 27.52 | **25.01** |
48
+ | | | | Reference|-free Metrics | | | | |
49
  | BARTScore-para (src-hypo) | 29.12 | 7.01 | 22.32 | 18.8 | -2.21 | 4.26 | 14.15 | 13.35 |
50
  | BARTScore-cnn (src-hypo) | 26.63 | 9.4 | 23.69 | 28.93 | 1.23 | 19.09 | 23.29 | 18.89 |
51
  | Llama-2-13b-chat-0-shot | 25.22 | 11.79 | 23.45 | 15.96 | 1.08 | 19.5 | 21.52 | 16.93 |
 
53
  | GPTScore-src | 28.2 | 6.5 | 19.81 | 27.64 | 11.64 | 20.04 | 16.36 | 18.6 |
54
  | TigerScore-7B | 28.79 | 33.65 | 32.44 | 33.93 | 19.98 | 38.13 | 29.72 | 30.95 |
55
  | TigerScore-13B | **31.29** | **36.5** | **36.43** | **33.17** | **21.58** | **41.84** | **35.33** | **33.73** |
56
+ | (ours - best reference-free) | 2 | 0 | 13 | 4 | 10 | 15 | 14 | 15 |
57
+ | (ours - best reference-based) | -4 | -6 | 2 | 4 | 2 | 5 | 4 | 8 |
58
 
59
  ### Pearson Results
60