Update README.md

36ce215 verified 5 months ago

4.97 kB

	---
	license: apache-2.0
	---
	一个面向中文文本纠错任务的综合平台，集学术研究、模型训练、模型评测和推理部署于一体，覆盖拼写纠错与语法纠错两个核心方向。

	🔥 项目地址：https://github.com/TW-NLP/ChineseErrorCorrector
	## 模型列表

	\| 模型名称 \| 纠错类型 \| 描述 \|
	\|:--------------------------------------------------------------------------------------------\|:------\|:-------------------------------------------\|
	\| [twnlp/ChineseErrorCorrector3-4B](https://huggingface.co/twnlp/ChineseErrorCorrector3-4B) \| 语法+拼写 \| 使用200万纠错数据进行全量训练，适用于语法纠错和拼写纠错，效果最好，推荐使用。 \|
	\| [twnlp/ChineseErrorCorrector2-7B](https://huggingface.co/twnlp/ChineseErrorCorrector2-7B) \| 语法+拼写 \| 使用200万纠错数据进行多轮迭代训练，适用于语法纠错和拼写纠错，效果较好。 \|
	## 模型评测（NaCGEC Data）
	\| Model Name \| Model Link \| Base Model \| Avg \| SIGHAN-2015 \| EC-LAW \| MCSC \| GPU \| QPS \|
	\|:------------------\|:------------------------------------------------------------------------------------------------------------------------\|:-------------------------------\|:-----------\|:------------\|:-------\|:-------\|:--------\|:--------\|
	\| Kenlm-CSC \| [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) \| kenlm \| 0.3409 \| 0.3147 \| 0.3763 \| 0.3317 \| CPU \| 9 \|
	\| Mengzi-T5-CSC \| [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) \| mengzi-t5-base \| 0.3984 \| 0.7758 \| 0.3156 \| 0.1039 \| GPU \| 214 \|
	\| ERNIE-CSC \| [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) \| PaddlePaddle/ernie-1.0-base-zh \| 0.4353 \| 0.8383 \| 0.3357 \| 0.1318 \| GPU \| 114 \|
	\| MacBERT-CSC \| [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) \| hfl/chinese-macbert-base \| 0.3993 \| 0.8314 \| 0.1610 \| 0.2055 \| GPU \| 224 \|
	\| ChatGLM3-6B-CSC \| [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) \| THUDM/chatglm3-6b \| 0.4538 \| 0.6572 \| 0.4369 \| 0.2672 \| GPU \| 3 \|
	\| Qwen2.5-1.5B-CTC \| [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) \| Qwen/Qwen2.5-1.5B-Instruct \| 0.6802 \| 0.3032 \| 0.7846 \| 0.9529 \| GPU \| 6 \|
	\| Qwen2.5-7B-CTC \| [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) \| Qwen/Qwen2.5-7B-Instruct \| 0.8225 \| 0.4917 \| 0.9798 \| 0.9959 \| GPU \| 3 \|
	\| Qwen3-4B-CTC(Our) \| [twnlp/ChineseErrorCorrector3-4B](https://huggingface.co/twnlp/ChineseErrorCorrector3-4B) \| Qwen/Qwen3-4B \| 0.8521 \| 0.6340 \| 0.9360 \| 0.9864 \| GPU \| 5 \|




	Without [ChineseErrorCorrector](https://github.com/TW-NLP/ChineseErrorCorrector), you can use the model like this:

	First, you pass your input through the transformer model, then you get the generated sentence.

	Install package:
	```
	pip install transformers
	```

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "twnlp/ChineseErrorCorrector3-4B"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "你是一个文本纠错专家，纠正输入句子中的语法错误，并输出正确的句子，输入句子为："
	text_input = "对待每一项工作都要一丝不够。"
	messages = [
	{"role": "user", "content": prompt + text_input}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)

	```

	output:
	```shell
	对待每一项工作都要一丝不苟。
	```