TIGER-Lab
/

TIGERScore-7B

Text Generation

text evaluation

text2text-generation

text-generation-inference

Model card Files Files and versions

wenhu commited on Dec 5, 2023

Commit

6f413ca

·

1 Parent(s): 440f6e6

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ The models are fine-tuned with the MetricInstruct dataset using the original Lla
 TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTScore, and BLEURT, and emerging LLM-based metrics as reference-free metrics. Though our dataset was originally sourced from ChatGPT, our distilled model actually outperforms ChatGPT itself, which proves the effectiveness of our filtering strategy. On the unseen task of story generation, TIGERScore also demonstrates reasonable generalization capability.
-| Tasks→                                    | Summarization  | Translation    | Data2Text      | Long-form QA    | MathQA         | Instruction    | Story-Gen      | Average        |
 |-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
 | GPT-3.5-turbo (few-shot)                  | **38.50**      | 40.53          | 40.20          | 29.33           | **66.46**      | 23.20          | 4.77           | 34.71          |
 | GPT-4 (zero-shot)                         | 36.46          | **43.87**      | **44.04**      | **48.95**       | 51.71          | **58.53**      | **32.48**      | **45.15**      |

 TIGERScore significantly surpasses traditional metrics, i.e. BLUE, ROUGE, BARTScore, and BLEURT, and emerging LLM-based metrics as reference-free metrics. Though our dataset was originally sourced from ChatGPT, our distilled model actually outperforms ChatGPT itself, which proves the effectiveness of our filtering strategy. On the unseen task of story generation, TIGERScore also demonstrates reasonable generalization capability.
+| Tasks→                                    | Summarization  | Translation    | Data2Text      | Long-form QA    | MathQA         | Instruction Following   | Story-Gen      | Average        |
 |-------------------------------------------|----------------|----------------|----------------|-----------------|----------------|----------------|----------------|----------------|
 | GPT-3.5-turbo (few-shot)                  | **38.50**      | 40.53          | 40.20          | 29.33           | **66.46**      | 23.20          | 4.77           | 34.71          |
 | GPT-4 (zero-shot)                         | 36.46          | **43.87**      | **44.04**      | **48.95**       | 51.71          | **58.53**      | **32.48**      | **45.15**      |