beyond commited on
Commit
745db76
·
1 Parent(s): 145cd18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -27
README.md CHANGED
@@ -42,13 +42,27 @@ inference:
42
  - Paper: [coming soon](to_be_added)
43
  - GitHub: [SEGA](https://github.com/beyondguo/SEGA).
44
 
45
- **SEGA** is able to write complete paragraphs given a sketch (or framework), which can be composed of:
46
- - keywords /key-phrases, like [NLP | AI | computer science]
47
- - spans, like [Conference on Empirical Methods | submission of research papers]
48
- - sentences, like [I really like machine learning | I work at Google since last year]
49
- - all mixup~
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ### How to use
 
52
  ```python
53
  from transformers import pipeline
54
  # 1. load the model with the huggingface `pipeline`
@@ -64,34 +78,42 @@ Output:
64
  'The Conference on Empirical Methods welcomes the submission of research papers. Abstracts should be in the form of a paper or presentation. Please submit abstracts to the following email address: eemml.stanford.edu. The conference will be held at Stanford University on April 1618, 2019. The theme of the conference is Deep Learning.'
65
  ```
66
 
67
- ## Model variations
68
-
69
-
70
- | Model | #params | Language |
71
- |------------------------|--------------------------------|-------|
72
- | [`sega-large`](https://huggingface.co/beyond/sega-large) | xM | English |
73
- | [`sega-base`(coming soon)]() | xM | English |
74
- | [`sega-large-chinese`(coming soon)]() | xM | Chinese |
75
- | [`sega-base-chinese`(New!)](https://huggingface.co/beyond/sega-base-chinese) | xM | Chinese |
76
 
 
77
 
78
- ## Data Augmentation for Text Classification Tasks:
79
  - Setting: Low-resource setting, where only n={50,100,200,500,1000} labeled samples are available for training. The below results are the average of all training sizes.
80
  - Datasets: [HuffPost](https://huggingface.co/datasets/khalidalt/HuffPost), [BBC](https://huggingface.co/datasets/SetFit/bbc-news), [SST2](https://huggingface.co/datasets/glue), [IMDB](https://huggingface.co/datasets/imdb), [Yahoo](https://huggingface.co/datasets/yahoo_answers_topics), [20NG](https://huggingface.co/datasets/newsgroup).
81
  - Base classifier: [DistilBERT](https://huggingface.co/distilbert-base-cased)
82
 
83
- | Method | HuffPost | BBC | SST2 | IMDB | Yahoo | 20NG | avg. |
84
- |---------|:------------------:|:------------------:|:----------------------:|:----------------------:|:----------:|:----------:|:----------:|
85
- | | ID / OOD (BBC) | ID / OOD (Huff) | ID / OOD (IMDB) | ID / OOD (SST2) | | | |
86
- | none | 79.17 / 62.32 | **96.16** / 62.00 | 76.67 / 73.16 | 77.87 / 74.43 | 45.77 | 46.67 | 69.42 |
87
- | EDA | 79.63 / 67.48 | 95.11 / 58.92 | 75.52 / 69.46 | 77.88 / 75.88 | 45.10 | 46.15 | 69.11 |
88
- | STA | 80.74 / 69.31 | 95.64 / 64.82 | 77.80 / 73.66 | 77.88 / 74.77 | 46.96 | 47.27 | 70.88 |
89
- | Back | 80.48 / 67.75 | 95.28 / 63.10 | 76.96 / 72.23 | 78.35 / 75.96 | 46.10 | 46.61 | 70.28 |
90
- | MLM | 80.04 / 66.80 | 96.07 / 65.39 | 76.61/ 73.11 | 75.73 / 73.70 | 45.35 | 46.53 | 69.93 |
91
- | C-MLM | 79.96 / 65.10 | 96.13 / **67.80** | 76.91 / 71.83 | 77.31 / 75.02 | 45.29 | 46.36 | 70.17 |
92
- | LAMBADA | 81.03 / 68.89 | 93.75 / 52.79 | 77.87 / 74.54 | 77.49 / 74.33 | 50.66 | 47.72 | 69.91 |
93
- | **SEGA (Ours)** | 81.43 / 74.87 | 95.61 / 67.79 | 77.87 / 72.94 | **79.51** / **76.75** | 49.43 | 50.47 | 72.67 |
94
- | **SEGA-f (Ours)** | **81.82** / **76.18** | 95.78 / 67.79 | **80.59** / **80.32** | 79.37 / 76.61 | **50.12** | **50.81** | **73.94** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
 
97
 
 
42
  - Paper: [coming soon](to_be_added)
43
  - GitHub: [SEGA](https://github.com/beyondguo/SEGA).
44
 
45
+ **SEGA** is able to write complete paragraphs given a *sketch*, which can be composed of:
46
+ - keywords /key-phrases, like "––NLP––AI––computer––science––"
47
+ - spans, like "Conference on Empirical Methods––submission of research papers––"
48
+ - sentences, like "I really like machine learning––I work at Google since last year––"
49
+ - or mixup~
50
+
51
+
52
+ **Model variations:**
53
+
54
+ | Model | #params | Language | comment|
55
+ |------------------------|--------------------------------|-------|---------|
56
+ | [`sega-large`](https://huggingface.co/beyond/sega-large) | 406M | English | The version used in paper |
57
+ | [`sega-large-k2t`](https://huggingface.co/beyond/sega-large-k2t) | 406M | English | keywords-to-text |
58
+ | [`sega-base`](https://huggingface.co/beyond/sega-base) | 139M | English | smaller version |
59
+ | [`sega-base-ps`](https://huggingface.co/beyond/sega-base) | 139M | English | pre-trained both in paragraphs and short sentences |
60
+ | [`sega-base-chinese`](https://huggingface.co/beyond/sega-base-chinese) | 116M | 中文 | 在一千万纯净中文段落上预训练|
61
+
62
+ ---
63
 
64
  ### How to use
65
+ #### 1. If you want to generate sentences given a **sketch**
66
  ```python
67
  from transformers import pipeline
68
  # 1. load the model with the huggingface `pipeline`
 
78
  'The Conference on Empirical Methods welcomes the submission of research papers. Abstracts should be in the form of a paper or presentation. Please submit abstracts to the following email address: eemml.stanford.edu. The conference will be held at Stanford University on April 1618, 2019. The theme of the conference is Deep Learning.'
79
  ```
80
 
81
+ #### 2. If you want to do **data augmentation** to generate new training samples
82
+ Please Check our Github page: [github.com/beyondguo/SEGA](https://github.com/beyondguo/SEGA), where we provide ready-to-run scripts for data augmentation for text classification/NER/MRC tasks.
 
 
 
 
 
 
 
83
 
84
+ ---
85
 
86
+ ## SEGA as A Strong Data Augmentation Tool:
87
  - Setting: Low-resource setting, where only n={50,100,200,500,1000} labeled samples are available for training. The below results are the average of all training sizes.
88
  - Datasets: [HuffPost](https://huggingface.co/datasets/khalidalt/HuffPost), [BBC](https://huggingface.co/datasets/SetFit/bbc-news), [SST2](https://huggingface.co/datasets/glue), [IMDB](https://huggingface.co/datasets/imdb), [Yahoo](https://huggingface.co/datasets/yahoo_answers_topics), [20NG](https://huggingface.co/datasets/newsgroup).
89
  - Base classifier: [DistilBERT](https://huggingface.co/distilbert-base-cased)
90
 
91
+
92
+ In-distribution (ID) evaluations:
93
+ | Method | Huff | BBC | Yahoo | 20NG | IMDB | SST2 | avg. |
94
+ |:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
95
+ | none | 79.17 | **96.16** | 45.77 | 46.67 | 77.87 | 76.67 | 70.39 |
96
+ | EDA | 79.20 | 95.11 | 45.10 | 46.15 | 77.88 | 75.52 | 69.83 |
97
+ | BackT | 80.48 | 95.28 | 46.10 | 46.61 | 78.35 | 76.96 | 70.63 |
98
+ | MLM | 80.04 | 96.07 | 45.35 | 46.53 | 75.73 | 76.61 | 70.06 |
99
+ | C-MLM | 80.60 | 96.13 | 45.40 | 46.36 | 77.31 | 76.91 | 70.45 |
100
+ | LAMBADA | 81.46 | 93.74 | 50.49 | 47.72 | 78.22 | 78.31 | 71.66 |
101
+ | STA | 80.74 | 95.64 | 46.96 | 47.27 | 77.88 | 77.80 | 71.05 |
102
+ | **SEGA** | 81.43 | 95.74 | 49.60 | 50.38 | **80.16** | 78.82 | 72.68 |
103
+ | **SEGA-f** | **81.82** | 95.99 | **50.42** | **50.81** | 79.40 | **80.57** | **73.17** |
104
+
105
+ Out-of-distribution (OOD) evaluations:
106
+ | | Huff->BBC | BBC->Huff | IMDB->SST2 | SST2->IMDB | avg. |
107
+ |------------|:----------:|:----------:|:----------:|:----------:|:----------:|
108
+ | none | 62.32 | 62.00 | 74.37 | 73.11 | 67.95 |
109
+ | EDA | 67.48 | 58.92 | 75.83 | 69.42 | 67.91 |
110
+ | BackT | 67.75 | 63.10 | 75.91 | 72.19 | 69.74 |
111
+ | MLM | 66.80 | 65.39 | 73.66 | 73.06 | 69.73 |
112
+ | C-MLM | 64.94 | **67.80** | 74.98 | 71.78 | 69.87 |
113
+ | LAMBADA | 68.57 | 52.79 | 75.24 | 76.04 | 68.16 |
114
+ | STA | 69.31 | 64.82 | 74.72 | 73.62 | 70.61 |
115
+ | **SEGA** | 74.87 | 66.85 | 76.02 | 74.76 | 73.13 |
116
+ | **SEGA-f** | **76.18** | 66.89 | **77.45** | **80.36** | **75.22** |
117
 
118
 
119