| --- |
| language: |
| - zh |
| license: apache-2.0 |
| tags: |
| - t5 |
| - text error correction |
| widget: |
| - text: "今天天气不太好,我的心情也不是很偷快" |
| example_title: "案例1" |
| - text: "能不能帮我买点淇淋,好久没吃了。" |
| example_title: "案例2" |
| - text: "脑子有点胡涂了,这道题冥冥学过还没有做出来" |
| example_title: "案例3" |
| inference: |
| parameters: |
| max_length: 256 |
| num_beams: 10 |
| no_repeat_ngram_size: 5 |
| do_sample: True |
| early_stopping: True |
| --- |
| |
| ## 功能介绍 |
|
|
| T5Corrector:中文字音与字形纠错模型 |
|
|
| 这个模型是基于mengzi-t5-base进行文本纠错训练,使用2kw+句子,通过替换同音词、近音词和形近字来,对于句中词组随机添加词组、删除词组中的部分字,以及字词乱序操作构造纠错平行语料,共计2亿+句对,累计训练66000步。 |
|
|
| <a href='https://github.com/Macielyoung/T5Corrector'>Github项目地址</a> |
|
|
|
|
|
|
| 加载模型: |
|
|
| ```python |
| # 加载模型 |
| from transformers import T5Tokenizer, T5ForConditionalGeneration |
| pretrained = "Maciel/T5Corrector-base-v2" |
| tokenizer = T5Tokenizer.from_pretrained(pretrained) |
| model = T5ForConditionalGeneration.from_pretrained(pretrained) |
| ``` |
|
|
| 使用模型进行预测推理方法: |
| ```python |
| import torch |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model.to(device) |
| |
| def correct(text, max_length): |
| model_inputs = tokenizer(text, |
| max_length=max_length, |
| truncation=True, |
| return_tensors="pt").to(device) |
| output = model.generate(**model_inputs, |
| num_beams=5, |
| no_repeat_ngram_size=4, |
| do_sample=True, |
| early_stopping=True, |
| max_length=max_length, |
| return_dict_in_generate=True, |
| output_scores=True) |
| pred_output = tokenizer.batch_decode(output.sequences, skip_special_tokens=True)[0] |
| return pred_output |
| |
| text = "贵州毛台现在多少钱一瓶啊,想买两瓶尝尝味道。" |
| correction = correct(text, max_length=32) |
| print(correction) |
| ``` |
|
|
|
|
|
|
| ### 案例展示 |
|
|
| ``` |
| 示例1: |
| input: 能不能帮我买点淇淋,好久没吃了。 |
| output: 能不能帮我买点冰淇淋,好久没吃了。 |
| |
| 示例2: |
| input: 脑子有点胡涂了,这道题冥冥学过还没有做出来 |
| output: 脑子有点糊涂了,这道题明明学过还没有做出来 |
| |
| 示例3: |
| input: 今天天气不太好,我的心情也不是很偷快 |
| output: 今天天气不太好,我的心情也不是很愉快 |
| ``` |