| | --- |
| | language: |
| | - en |
| | - zh |
| | tags: |
| | - GENIUS |
| |
|
| | license: apache-2.0 |
| | datasets: |
| | - c4 |
| | - beyond/chinese_clean_passages_80m |
| |
|
| |
|
| | widget: |
| | - text: "[MASK]酸菜鱼火锅[MASK]很美味,味道绝了[MASK]周末真开心[MASK]" |
| | example_title: "草稿1" |
| | - text: "自然语言处理[MASK]谷歌公司[MASK]通用人工智能[MASK]" |
| | example_title: "草稿2" |
| | - text: "[MASK]疫情[MASK]公园[MASK]散步[MASK]" |
| | example_title: "草稿3" |
| |
|
| | inference: |
| | parameters: |
| | max_length: 1000 |
| | num_beams: 3 |
| | do_sample: True |
| | --- |
| | |
| | # GENIUS: generating text using sketches! |
| |
|
| |
|
| | - **Paper: [GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation](https://arxiv.org/abs/2211.10330)** |
| | - **GitHub: [GENIUS, Pre-training/Data Augmentation Tutorial](https://github.com/beyondguo/genius)** |
| |
|
| |
|
| |
|
| | **GENIUS中文版** 可以根据你给出的一个**草稿**进行填词造句扩写,草稿可以是: |
| | - 关键词组合,例如“今天[MASK]篮球[MASK]学校[MASK]” |
| | - 短语组合,例如“自然语言处理[MASK]谷歌[MASK]通用人工智能[MASK]” |
| | - 短句子组合,例如“我昨天做了一个梦[MASK]又遇见了她[MASK]曾经那段时光让人怀恋[MASK]” |
| | - 以上的混合 |
| |
|
| | ### How to use / 如何使用 |
| | ```python |
| | # genius-chinese |
| | from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline |
| | checkpoint = 'leelearn/genius-yiyan-prompt-generator' |
| | tokenizer = BertTokenizer.from_pretrained(checkpoint) |
| | genius_model = BartForConditionalGeneration.from_pretrained(checkpoint) |
| | genius_generator = Text2TextGenerationPipeline(genius_model, tokenizer, device=0) |
| | genius_generator |
| | |
| | sketchs = [ |
| | "今天[MASK]篮球[MASK]学校[MASK]", |
| | "自然语言处理[MASK]谷歌[MASK]通用人工智能[MASK]", |
| | "我昨天做了一个梦[MASK]又遇见了她[MASK]曾经那段时光让人怀恋[MASK]", |
| | "[MASK]疫情[MASK]公园[MASK]散步[MASK]", |
| | "[MASK]酸菜鱼火锅[MASK]很美味,味道绝了[MASK]周末真开心[MASK]" |
| | "" |
| | ] |
| | for sketch in sketchs: |
| | print('input sketch:\n>>> ', sketch) |
| | print('genius-chinese output:\n>>> ',genius_generator(sketch, max_length=100, do_sample=True, num_beams=3)[0]['generated_text'].replace(' ',''),'\n') |
| | ``` |
| |
|
| | ## Model variations / GENIUS其他版本 |
| |
|
| | | Model | #params | Language | comment| |
| | |------------------------|--------------------------------|-------|---------| |
| | | [`genius-large`](https://huggingface.co/beyond/genius-large) | 406M | English | The version used in paper | |
| | | [`genius-large-k2t`](https://huggingface.co/beyond/genius-large-k2t) | 406M | English | keywords-to-text | |
| | | [`genius-base`](https://huggingface.co/beyond/genius-base) | 139M | English | smaller version | |
| | | [`genius-base-ps`](https://huggingface.co/beyond/genius-base) | 139M | English | pre-trained both in paragraphs and short sentences | |
| | | [`genius-base-chinese`](https://huggingface.co/beyond/genius-base-chinese) | 116M | 中文 | 在一千万纯净中文段落上预训练| |
| |
|
| |
|
| | ## Comparison / 效果对比 |
| | The following comes the comparison between [BART-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) and our proposed [GENIUS-base-chinese](https://huggingface.co/beyond/genius-base-chinese).\ |
| | 下面对比了[BART-base-chinese](https://huggingface.co/fnlp/bart-base-chinese)和我们提出的**GENIUS-base-chinese**在填词造句方面的表现: |
| |
|
| | ``` |
| | input sketch: |
| | >>> 今天[MASK]篮球[MASK]上海财经大学[MASK] |
| | BART-chinese output: |
| | >>> 今天的篮球是上海财经大学篮球 |
| | GENIUS-chinese output: |
| | >>> 今天,我们邀请到了中国篮球联盟主席、上海财经大学校长孙建国先生作为主题发言。 |
| | |
| | input sketch: |
| | >>> 自然语言处理[MASK]谷歌[MASK]通用人工智能[MASK] |
| | BART-chinese output: |
| | >>> 自然语言处理是谷歌的通用人工智能技术 |
| | GENIUS-chinese output: |
| | >>> 自然语言处理是谷歌在通用人工智能领域的一个重要研究方向,其目的是为了促进人类智能的发展。 |
| | |
| | input sketch: |
| | >>> 我昨天做了一个梦[MASK]又遇见了她[MASK]曾经那段时光让人怀恋[MASK] |
| | BART-chinese output: |
| | >>> 我昨天做了一个梦今天又遇见了她我曾经那段时光让人怀恋不已 |
| | GENIUS-chinese output: |
| | >>> 我昨天做了一个梦,梦见了我的妈妈,又遇见了她,我知道她曾经那段时光让人怀恋,但是现在,我不知道该怎么回事了,我只是想告诉她,不要再回去了。 |
| | |
| | input sketch: |
| | >>> [MASK]疫情[MASK]公园[MASK]漫步[MASK] |
| | BART-chinese output: |
| | >>> 在疫情防控公园内漫步徜徉 |
| | GENIUS-chinese output: |
| | >>> 为了防止疫情扩散,公园内还设置了漫步区。 |
| | |
| | input sketch: |
| | >>> [MASK]酸菜鱼火锅[MASK]很美味,味道绝了[MASK]周末真开心[MASK] |
| | BART-chinese output: |
| | >>> 这酸菜鱼火锅真的很美味,味道绝了这周末真开心啊 |
| | GENIUS-chinese output: |
| | >>> 这个酸菜鱼火锅真的很美味,味道绝了,吃的时间也长了,周末真开心,吃完以后就回家了,很满意的一次,很喜欢的一个品牌。 |
| | ``` |
| |
|
| | 可以看出,BART只能填补简单的一些词,无法对这些片段进行很连贯的连接,而GENIUS则可以扩写成连贯的句子甚至段落。 |
| |
|
| | --- |
| |
|
| | If you find our paper/code/demo useful, please cite our paper: |
| | ``` |
| | @article{guo2022genius, |
| | title={GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation}, |
| | author={Guo, Biyang and Gong, Yeyun and Shen, Yelong and Han, Songqiao and Huang, Hailiang and Duan, Nan and Chen, Weizhu}, |
| | journal={arXiv preprint arXiv:2211.10330}, |
| | year={2022} |
| | } |
| | ``` |