lfsm commited on
Commit
73a9dbb
·
1 Parent(s): 660b0ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -4,7 +4,7 @@ this is ja cc filter for reference from ja wiki vs random ja mc4, and build with
4
  2. crawl 300K of 4M webpages from the urls
5
  3. get pure text and remove content len less than 1k,
6
  4. use langdetect to tell the lang of the pages,
7
- we finally get total **16K**pages : **10K** ja pages, **5K** en pages, and **1K** other lang pages
8
  5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
9
  6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
10
  7. tokenize all text with "cl-tohoku/bert-base-japanese"
 
4
  2. crawl 300K of 4M webpages from the urls
5
  3. get pure text and remove content len less than 1k,
6
  4. use langdetect to tell the lang of the pages,
7
+ we finally get total **160K**pages : **100K** ja pages, **50K** en pages, and **10K** other lang pages
8
  5. random sample 16K from ja mc4, concat with all 16k pages to get lang_all.txt data
9
  6. random sample 10K from ja mc4, concat with ja 10k pages to get lang_ja.txt data
10
  7. tokenize all text with "cl-tohoku/bert-base-japanese"