Victor Sanh
commited on
Commit
·
3449b08
1
Parent(s):
68ebcd0
update training datasets list
Browse files
README.md
CHANGED
|
@@ -61,15 +61,17 @@ We trained different variants T0 with different mixtures of datasets.
|
|
| 61 |
|
| 62 |
|Model|Training datasets|
|
| 63 |
|--|--|
|
| 64 |
-
|T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA
|
| 65 |
-
|T0p_11B|Same as T0_11B with
|
| 66 |
-
|T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE:<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
|
| 67 |
|T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
|
| 68 |
|T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
|
| 69 |
|T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
|
| 70 |
|
| 71 |
For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
|
| 72 |
|
|
|
|
|
|
|
| 73 |
# Evaluation data
|
| 74 |
|
| 75 |
We systematically evaluate our models on a suite of held-out tasks:
|
|
@@ -82,20 +84,20 @@ We systematically evaluate our models on a suite of held-out tasks:
|
|
| 82 |
|Sentence completion|COPA, HellaSwag, Story Cloze|
|
| 83 |
|
| 84 |
We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
|
| 85 |
-
-
|
| 86 |
-
-
|
| 87 |
-
-
|
| 88 |
-
-
|
| 89 |
-
- Language
|
| 90 |
-
-
|
| 91 |
-
-
|
| 92 |
-
-
|
| 93 |
-
-
|
| 94 |
-
-
|
| 95 |
-
-
|
| 96 |
-
-
|
| 97 |
- VitaminC
|
| 98 |
-
-
|
| 99 |
|
| 100 |
# Limitations
|
| 101 |
|
|
|
|
| 61 |
|
| 62 |
|Model|Training datasets|
|
| 63 |
|--|--|
|
| 64 |
+
|T0_11B|- Multiple-Choice QA: CommonsenseQA, DREAM, QUAIL, QuaRTz, Social IQA, WiQA, Cosmos, QASC, Quarel, SciQ, Wiki Hop<br>- Extractive QA: Adversarial QA, Quoref, TyDiQA, DuoRC, ROPES<br>- Closed-Book QA: Hotpot QA*, Wiki QA<br>- Structure-To-Text: Common Gen, Wiki Bio<br>- Sentiment: Amazon, App Reviews, IMDB, Rotten Tomatoes, Yelp<br>- Summarization: CNN Daily Mail, Gigaword, MultiNews, SamSum, XSum<br>- Topic Classification: AG News, DBPedia, TREC<br>- Paraphrase Identification: MRPC, PAWS, QQP|
|
| 65 |
+
|T0p_11B|Same as T0_11B with additional datasets from GPT-3's evaluation suite:<br>- Multiple-Choice QA: ARC, OpenBook QA, PiQA, RACE, HellaSwag<br>- Extractive QA: SQuAD v2<br>- Closed-Book QA: Trivia QA, Web Questions|
|
| 66 |
+
|T0pp_11B|Same as T0p_11B with a few additional datasets from SuperGLUE (excluding NLI sets):<br>- BoolQ<br>- COPA<br>- MultiRC<br>- ReCoRD<br>- WiC<br>- WSC|
|
| 67 |
|T0_11B_single_prompt|Same as T0_11B but only one prompt per training dataset|
|
| 68 |
|T0_11B_original_task_only|Same as T0_11B but only original tasks templates|
|
| 69 |
|T0_3B|Same as T0_11B but starting from a T5-LM XL (3B parameters) pre-trained model|
|
| 70 |
|
| 71 |
For reproducibility, we release the data we used for training (and evaluation) in the [P3 dataset](TODO). Prompts examples can be found on the dataset page.
|
| 72 |
|
| 73 |
+
*: We recast Hotpot QA as closed-book QA due to long input sequence length.
|
| 74 |
+
|
| 75 |
# Evaluation data
|
| 76 |
|
| 77 |
We systematically evaluate our models on a suite of held-out tasks:
|
|
|
|
| 84 |
|Sentence completion|COPA, HellaSwag, Story Cloze|
|
| 85 |
|
| 86 |
We also evaluate T0_11B, T0p_11B and T0pp_11B on the a subset of the [BIG-bench benchmark](https://github.com/google/BIG-bench):
|
| 87 |
+
- Code description task
|
| 88 |
+
- Conceptual combinations
|
| 89 |
+
- Hindu knowledge json
|
| 90 |
+
- Known unknowns
|
| 91 |
+
- Language identification
|
| 92 |
+
- Logic grid puzzle task
|
| 93 |
+
- Logical deduction
|
| 94 |
+
- Common misconceptions
|
| 95 |
+
- Movie dialog same or different
|
| 96 |
+
- Novel concepts
|
| 97 |
+
- Strategyqa
|
| 98 |
+
- Formal fallacies syllogisms negation
|
| 99 |
- VitaminC
|
| 100 |
+
- Winowhy multiple choice
|
| 101 |
|
| 102 |
# Limitations
|
| 103 |
|