Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,76 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LM-Combiner
|
| 2 |
+
All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience!
|
| 3 |
+
|
| 4 |
+
# Requirements
|
| 5 |
+
|
| 6 |
+
The part of the model is implemented using the huggingface framework and the required environment is as follows:
|
| 7 |
+
- Python
|
| 8 |
+
- torch
|
| 9 |
+
- transformers
|
| 10 |
+
- datasets
|
| 11 |
+
- tqdm
|
| 12 |
+
|
| 13 |
+
For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT).
|
| 14 |
+
|
| 15 |
+
# Training Stage
|
| 16 |
+
## Preprocessing
|
| 17 |
+
### Baseline Model
|
| 18 |
+
- Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.
|
| 19 |
+
```bash
|
| 20 |
+
sh ./script/run_bart_baseline.sh
|
| 21 |
+
```
|
| 22 |
+
### Candidate Datasets
|
| 23 |
+
1. Candidate Sentence Generation
|
| 24 |
+
- We use the baseline model to generate candidate sentences for the training and test sets
|
| 25 |
+
- On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.
|
| 26 |
+
```bash
|
| 27 |
+
python ./src/predict_bl_tsv.py
|
| 28 |
+
```
|
| 29 |
+
2. Golden Labels Merging
|
| 30 |
+
- We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.
|
| 31 |
+
```bash
|
| 32 |
+
python ./scorer_wapper/golden_label_merging.py
|
| 33 |
+
```
|
| 34 |
+
## LM-combiner (gpt2)
|
| 35 |
+
- Subsequently, we train LM-Combiner on the constructed candidate dataset
|
| 36 |
+
- In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details.
|
| 37 |
+
```bash
|
| 38 |
+
sh ./script/run_lm_combiner.py
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
# Evaluation
|
| 42 |
+
- We use the official ChERRANT script to evaluate the model on the FCGEC-dev.
|
| 43 |
+
```shell
|
| 44 |
+
sh ./script/compute_score.sh
|
| 45 |
+
```
|
| 46 |
+
|method|Prec|Rec|F0.5|
|
| 47 |
+
|-|-|-|-|
|
| 48 |
+
| bart_baseline|28.88|**38.95**|40.46|
|
| 49 |
+
|+lm_combiner|**52.15**|37.41|**48.34**|
|
| 50 |
+
# Citation
|
| 51 |
+
|
| 52 |
+
If you find this work is useful for your research, please cite our paper:
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
@inproceedings{wang-etal-2024-lm-combiner,
|
| 56 |
+
title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
|
| 57 |
+
author = "Wang, Yixuan and
|
| 58 |
+
Wang, Baoxin and
|
| 59 |
+
Liu, Yijun and
|
| 60 |
+
Wu, Dayong and
|
| 61 |
+
Che, Wanxiang",
|
| 62 |
+
editor = "Calzolari, Nicoletta and
|
| 63 |
+
Kan, Min-Yen and
|
| 64 |
+
Hoste, Veronique and
|
| 65 |
+
Lenci, Alessandro and
|
| 66 |
+
Sakti, Sakriani and
|
| 67 |
+
Xue, Nianwen",
|
| 68 |
+
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
|
| 69 |
+
month = may,
|
| 70 |
+
year = "2024",
|
| 71 |
+
address = "Torino, Italia",
|
| 72 |
+
publisher = "ELRA and ICCL",
|
| 73 |
+
url = "https://aclanthology.org/2024.lrec-main.934",
|
| 74 |
+
pages = "10675--10685",
|
| 75 |
+
}
|
| 76 |
+
```
|