Spaces:
Running
Running
Update web.py
Browse files
web.py
CHANGED
|
@@ -217,7 +217,6 @@ def web_data():
|
|
| 217 |
),
|
| 218 |
H3("1. Document Preparation"),
|
| 219 |
|
| 220 |
-
button( Div(
|
| 221 |
H4("1.1 Text Extraction"),
|
| 222 |
P("""
|
| 223 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
|
@@ -226,7 +225,7 @@ def web_data():
|
|
| 226 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
| 227 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
| 228 |
"""),
|
| 229 |
-
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
| 230 |
|
| 231 |
H4("1.2 Language Identification"),
|
| 232 |
P("""
|
|
|
|
| 217 |
),
|
| 218 |
H3("1. Document Preparation"),
|
| 219 |
|
|
|
|
| 220 |
H4("1.1 Text Extraction"),
|
| 221 |
P("""
|
| 222 |
Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
|
|
|
|
| 225 |
we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
|
| 226 |
Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
|
| 227 |
"""),
|
| 228 |
+
DV2("data/sample_wet.json", "data/sample_warc.json", 3),
|
| 229 |
|
| 230 |
H4("1.2 Language Identification"),
|
| 231 |
P("""
|