Spaces:

LLM360
/

TxT360

Running

victormiller commited on Sep 26, 2024

Commit

b488013

verified ·

1 Parent(s): 146aa07

Update web.py

Files changed (1) hide show

web.py CHANGED Viewed

@@ -217,7 +217,6 @@ def web_data():
         ),
         H3("1. Document Preparation"),
-        button( Div(
         H4("1.1 Text Extraction"),
         P("""
         Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
@@ -226,7 +225,7 @@ def web_data():
         we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
         Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
         """),
-        DV2("data/sample_wet.json", "data/sample_warc.json", 3),), cls="collapsible"),
         H4("1.2 Language Identification"),
         P("""

         ),
         H3("1. Document Preparation"),
         H4("1.1 Text Extraction"),
         P("""
         Common Crawl provides webpage texts via two formats: WARC (Web ARChive format) and WET (WARC Encapsulated Text).
         we found WET files to include boilerplate content like navigation menus, ads, and other irrelevant texts.
         Accordingly, our pipeline starts from raw WARC files, reads with the warcio library, and extracts texts using trafilatura.
         """),
+        DV2("data/sample_wet.json", "data/sample_warc.json", 3),
         H4("1.2 Language Identification"),
         P("""