Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
## CC_FILTER
|
| 2 |
-
this is ja cc filter
|
| 3 |
1. get ja wiki dump file, and extract the all url inside, get about 4M urls
|
| 4 |
2. crawl 300K of 4M webpages from the urls
|
| 5 |
3. get pure text and remove content len less than 1k,
|
|
|
|
| 1 |
## CC_FILTER
|
| 2 |
+
this is ja cc filter for reference from ja wiki vs random ja mc4, and build with following procedure.
|
| 3 |
1. get ja wiki dump file, and extract the all url inside, get about 4M urls
|
| 4 |
2. crawl 300K of 4M webpages from the urls
|
| 5 |
3. get pure text and remove content len less than 1k,
|