WangZeJun
/

chinese_text_correction

Model card Files Files and versions

chinese_text_correction / README.md

WangZeJun's picture

Upload README.md with huggingface_hub

6c3a0e3 verified 10 months ago

|

history blame contribute delete

1.8 kB

	---
	license: apache-2.0
	---

	# chinese_text_correction

	中文文本纠错数据集，包含拼写和语法纠错数据，可用于中文校对模型的训练。

	**Repository:** [zejunwang1/CTCDataset](https://github.com/zejunwang1/CTCDataset)

	## Data distribution

	\| Source \| Type \| Sample \|
	\| --------- \| ------- \| ------ \|
	\| CCTC \| grammar \| 4470 \|
	\| cscd-ns \| spell \| 40000 \|
	\| CTC2021 \| grammar \| 969 \|
	\| ECSpell \| spell \| 8180 \|
	\| lemon \| spell \| 22252 \|
	\| MCSCSet \| spell \| 39302 \|
	\| midu2022 \| grammar \| 2014 \|
	\| NLPCC2023 \| spell \| 1000 \|
	\| Total \| — \| 118187 \|

	## Data Fields

	\| Field \| Type \| Description \|
	\| ------ \| ------ \| ----------------------------- \|
	\| source \| string \| 可能包含拼写/语法错误的源句子 \|
	\| target \| string \| 纠错后的目标句子 \|
	\| label \| int \| 源句子中是否包含错误，若为1，则包含错误，否则不包含错误。 \|

	```json
	{
	"source": "完善农产品上行发展机智。",
	"target": "完善农产品上行发展机制。",
	"label": 1
	}
	```

	## How to use it

	```python
	from datasets import load_dataset

	data = load_dataset('WangZeJun/chinese_text_correction')
	print(data)
	DatasetDict({
	train: Dataset({
	features: ['source', 'target', 'label'],
	num_rows: 118187
	})
	})
	```

	## License/Terms of Use

	### License

	[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

	### Data Developer

	[Zejun Wang](https://github.com/zejunwang1)

	### Use Case

	使用该数据集可进行中文纠错模型的训练。

	### Release Date

	04/17/2025

	## Data Version

	1.0 (04/17/2025)