conan1024hao
/

cjkbert-small

Model card Files Files and versions

conan1024hao commited on May 14, 2022

Commit

fea7bb4

·

1 Parent(s): 14a3793

Update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -22,7 +22,12 @@ from transformers import AutoTokenizer, AutoModelForMaskedLM
 tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
 model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
 ```
-You don't need any text segmentation when you fine-tune downstream tasks. (Though you may obtain better results if you apply morphological analysis to the data before fine-tuning.)
 ### Tokenization
 We use character-based tokenization with whole-word-masking strategy.

 tokenizer = AutoTokenizer.from_pretrained("conan1024hao/cjkbert-small")
 model = AutoModelForMaskedLM.from_pretrained("conan1024hao/cjkbert-small")
 ```
+Before you fine-tune downstream tasks, you don't need any text segmentation. (Though you may obtain better results if you applied morphological analysis to the data before fine-tuning.)
+### Morphological analysis tools
+- ZH: For Chinese, we use [LTP](https://github.com/HIT-SCIR/ltp).
+- JA: For Japanese, we use [Juman++](https://github.com/ku-nlp/jumanpp).
+- KO: For Korean, we use [KoNLPy](https://github.com/konlpy/konlpy)(Kkma class).
 ### Tokenization
 We use character-based tokenization with whole-word-masking strategy.