DecoderImmortal
/

LM-Combiner

Model card Files Files and versions

DecoderImmortal commited on Nov 22, 2024

Commit

c0c6002

·

verified ·

1 Parent(s): a30b942

Update README.md

Files changed (1) hide show

README.md +76 -3

README.md CHANGED Viewed

@@ -1,3 +1,76 @@
----
-license: apache-2.0
----

+# LM-Combiner
+All the code and model are released [link](https://github.com/wyxstriker/LM-Combiner). Thank you for your patience!
+# Requirements
+The part of the model is implemented using the huggingface framework and the required environment is as follows:
+- Python
+- torch
+- transformers
+- datasets
+- tqdm
+For the evaluation, we refer to the relevant environment configurations of [ChERRANT](https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT).
+# Training Stage
+## Preprocessing
+### Baseline Model
+- Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.
+```bash
+sh ./script/run_bart_baseline.sh
+```
+### Candidate Datasets
+1. Candidate Sentence Generation
+- We use the baseline model to generate candidate sentences for the training and test sets
+- On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.
+```bash
+python ./src/predict_bl_tsv.py
+```
+2. Golden Labels Merging
+- We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.
+```bash
+python ./scorer_wapper/golden_label_merging.py
+```
+## LM-combiner (gpt2)
+- Subsequently, we train LM-Combiner on the constructed candidate dataset
+- In particular, we supplement the gpt2 vocab (mainly **double quotes**) to better fit the FCGEC dataset, see ```./pt_model/gpt2-base/vocab.txt``` for details.
+```bash
+sh ./script/run_lm_combiner.py
+```
+# Evaluation
+- We use the official ChERRANT script to evaluate the model on the FCGEC-dev.
+```shell
+sh ./script/compute_score.sh
+```
+|method|Prec|Rec|F0.5|
+|-|-|-|-|
+| bart_baseline|28.88|**38.95**|40.46|
+|+lm_combiner|**52.15**|37.41|**48.34**|
+# Citation
+If you find this work is useful for your research, please cite our paper:
+```
+@inproceedings{wang-etal-2024-lm-combiner,
+    title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
+    author = "Wang, Yixuan  and
+      Wang, Baoxin  and
+      Liu, Yijun  and
+      Wu, Dayong  and
+      Che, Wanxiang",
+    editor = "Calzolari, Nicoletta  and
+      Kan, Min-Yen  and
+      Hoste, Veronique  and
+      Lenci, Alessandro  and
+      Sakti, Sakriani  and
+      Xue, Nianwen",
+    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
+    month = may,
+    year = "2024",
+    address = "Torino, Italia",
+    publisher = "ELRA and ICCL",
+    url = "https://aclanthology.org/2024.lrec-main.934",
+    pages = "10675--10685",
+}
+```