File size: 1,798 Bytes

6c3a0e3

---

license: apache-2.0
---


# chinese_text_correction

中文文本纠错数据集，包含拼写和语法纠错数据，可用于中文校对模型的训练。

****Repository:**** [zejunwang1/CTCDataset](https://github.com/zejunwang1/CTCDataset)

## Data distribution

| Source    | Type    | Sample |
| --------- | ------- | ------ |
| CCTC      | grammar | 4470   |
| cscd-ns   | spell   | 40000  |
| CTC2021   | grammar | 969    |
| ECSpell   | spell   | 8180   |
| lemon     | spell   | 22252  |
| MCSCSet   | spell   | 39302  |
| midu2022  | grammar | 2014   |
| NLPCC2023 | spell   | 1000   |
| Total     | —       | 118187 |

## Data Fields

| Field  | Type   | Description                   |
| ------ | ------ | ----------------------------- |
| source | string | 可能包含拼写/语法错误的源句子               |
| target | string | 纠错后的目标句子                      |
| label  | int    | 源句子中是否包含错误，若为1，则包含错误，否则不包含错误。 |

```json

{

    "source": "完善农产品上行发展机智。",

    "target": "完善农产品上行发展机制。",

    "label": 1

}

```

## How to use it

```python

from datasets import load_dataset



data = load_dataset('WangZeJun/chinese_text_correction')

print(data)

DatasetDict({

    train: Dataset({

        features: ['source', 'target', 'label'],

        num_rows: 118187

    })

})

```

## License/Terms of Use

### License

[Apache License 2.0](https://choosealicense.com/licenses/apache-2.0/)

### Data Developer

[Zejun Wang](https://github.com/zejunwang1)

### Use Case

使用该数据集可进行中文纠错模型的训练。

### Release Date

04/17/2025

## Data Version

1.0 (04/17/2025)