Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## CC_FILTER
|
| 2 |
+
this is ja cc filter fo reference from ja wiki vs random ja mc4, and build with following procedure.
|
| 3 |
+
1. get ja wiki dump file, and extract the all url inside, get about 4M urls
|
| 4 |
+
2. crawl 300K of 4M webpages from the urls
|
| 5 |
+
3. get pure text and remove content len less than 1k,
|
| 6 |
+
4. use langdetect to tell the lang of the pages,
|
| 7 |
+
we finally get total **16K**pages : **10K** ja pages, **5K** en pages, and **1K** other lang pages
|
| 8 |
+
5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
|
| 9 |
+
6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
|
| 10 |
+
7. tokenize all text with "cl-tohoku/bert-base-japanese"
|
| 11 |
+
8. feed lang_all.txt to fasttext to get model_all.bin
|
| 12 |
+
9. feed lang_ja.txt to fasttext to get model_ja.bin
|