Spaces:

hari3485
/

DiveIntoML

Sleeping

App Files Files Community

hari3485 commited on Dec 19, 2024

Commit

140a95d

verified ·

1 Parent(s): b8484bf

Update pages/hari.py

Browse files

Files changed (1) hide show

pages/hari.py +66 -3

pages/hari.py CHANGED Viewed

@@ -232,10 +232,73 @@ def html_details_page():
     - Semi-structured data with nested tags.
     - Libraries like `BeautifulSoup` help parse and extract information.
     """)
-    st.code('from bs4 import BeautifulSoup\nsoup = BeautifulSoup(open("file.html"))', language="python")
-    if st.button("Back to Home"):
-        st.session_state['page'] = "home"
 # Unstructured Data - Image Page
 def image_details_page():

     - Semi-structured data with nested tags.
     - Libraries like `BeautifulSoup` help parse and extract information.
     """)
+    st.write("""
+    - **HTML (HyperText Markup Language)** is a semi-structured data format.
+    - HTML uses tags like `<table>`, `<tr>`, `<th>`, and `<td>` to structure tabular data.
+    - Unlike XML, HTML does not allow creating custom tags freely.
+    - Not all HTML content can be converted into dataframes, especially paragraph text or unstructured data.
+    - Typically, only table-related elements (`<table>`, `<tr>`, `<th>`, `<td>`) can be converted into dataframes.
+    """)
+    # Reading HTML Files Section
+    st.header("Reading HTML Files into DataFrames")
+    st.write("**Reading HTML Files:**")
+    st.code("""
+    import pandas as pd
+    tables = pd.read_html(path_or_buffer)
+    """, language="python")
+    st.write("""
+    - **`pd.read_html(path_or_buffer)`** reads HTML files or websites containing tables.
+    - Extracts all tables and returns them as a list of dataframes.
+    """)
+    st.write("**Accessing Specific Tables:**")
+    st.code("""
+    # Accessing the first table from the list
+    table = tables[0]
+    """, language="python")
+    st.write("""
+    - Each table is stored in the list by index.
+    - Use indexing to select the table you want to work with.
+    """)
+    st.write("**Limitations:**")
+    st.write("""
+    - Not all websites or HTML files can be read, even if they have tables.
+    - Issues like authorization restrictions can prevent reading certain tables.
+    """)
+    st.write("**Using the `match` Parameter:**")
+    st.code("""
+    # Reading a specific table using the match parameter
+    tables = pd.read_html(path, match="keyword")
+    """, language="python")
+    st.write("""
+    - To locate specific tables, use `match="keyword"` while reading HTML.
+    - The `match` parameter searches for tables containing the specified keyword.
+    """)
+    # Exporting DataFrames Section
+    st.header("Exporting DataFrames to HTML")
+    st.write("**Exporting DataFrame to HTML:**")
+    st.code("""
+    # Exporting a dataframe to an HTML file
+    df.to_html("output.html")
+    """, language="python")
+    st.write("""
+    - Converts a dataframe into an HTML file.
+    - Saves the dataframe in an HTML-compatible table format at the specified path.
+    """)
+    st.code('from bs4 import BeautifulSoup\nsoup = BeautifulSoup(open("file.html"))', language="python")
+        if st.button("Back to Home"):
+            st.session_state['page'] = "home"
 # Unstructured Data - Image Page
 def image_details_page():