Spaces:

hari3485
/

DiveIntoML

Sleeping

App Files Files Community

hari3485 commited on Dec 19, 2024

Commit

2ecffec

verified ·

1 Parent(s): 140a95d

Update pages/Data Collection.py

Browse files

Files changed (1) hide show

pages/Data Collection.py +64 -213

pages/Data Collection.py CHANGED Viewed

@@ -1,220 +1,71 @@
 import streamlit as st
-import pandas as pd
-# Function for the Excel details page
-def excel_details_page():
-    st.title("Structured Data - Excel Details")
-    st.markdown("<h3 style='text-align:; color: #4a90e2;'>1. Handling Excel Files (.xlsx)</h3>", unsafe_allow_html=True)
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>Excel Files are (XLSX) Created using the Microsoft Excel application.</li>
-        <li>Structured data format.</li>
-        <li>Excel files automatically handle encoding during creation, so no encoding issues arise.</li>
-        <li>If there are extra values in a row, Excel creates a new column and fills it with <b>null values</b> instead of throwing a <b>parsing error</b>.</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    st.markdown("<h3 style='text-align:; color: #ffa500;'>2. Reading Excel Files (.xlsx)</h3>", unsafe_allow_html=True)
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>Use the <b>pandas</b> function, <b>pd.read_excel("path")</b>, to read an Excel file.</li>
-        <li>By default, it reads only one sheet.</li>
-        <li>To read multiple sheets, specify the <b>sheet_name</b> parameter with a list of sheet indices.</li>
-    </ul>""", unsafe_allow_html=True)
-    st.code('df = pd.read_excel("path", sheet_name=[0, 1, 2])', language="python")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li><b>The Result is a Dictionary</b></li>
-        <li>Keys: Sheet names.</li>
-        <li>Values: DataFrames corresponding to each sheet.</li>
-    </ul>""", unsafe_allow_html=True)
-    st.code('df_first_sheet = df[0]  # First sheet\n'
-            'df_second_sheet = df[1]  # Second sheet\n'
-            'df_third_sheet = df[2]  # Third sheet', language="python")
-    st.markdown("<h3 style='text-align:; color: #dda0dd;'>3. Converting Data to Excel Files (.xlsx)</h3>", unsafe_allow_html=True)
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>To save a single DataFrame to an Excel file</li>
-    </ul>""", unsafe_allow_html=True)
-    st.code('df[0].to_excel("path")', language="python")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>To save multiple sheets, use <b>pd.ExcelWriter</b></li>
-    </ul>""", unsafe_allow_html=True)
-    st.code("""with pd.ExcelWriter("path") as writer:
-    df[0].to_excel(writer, sheet_name="Sheet1")
-    df[1].to_excel(writer, sheet_name="Sheet2")""", language="python")
-    # Button to go back to the main page
-    if st.button("Back to Home"):
-        st.session_state['page'] = "home"
-# Function for the CSV details page
-def csv_details_page():
-    import streamlit as st
-# App header
-# Create a button
-    # Display the content about semi-structured data
-    st.header("1. What is Semi-Structured Data?")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>Semi-structured data does not follow a strict tabular format but still has some organizational properties.</li>
-        <li>Examples include CSV files, JSON, and XML.</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    st.header("2. Working with CSV Files")
-    st.subheader("a) Reading a CSV File")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>Use the <b>pandas</b> function, <code>pd.read_csv("file.csv")</code>, to read a CSV file.</li>
-        <li>This function loads the file into a DataFrame.</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    # Code example for reading CSV
-    st.code("""
 import pandas as pd
-df = pd.read_csv("file.csv")
-print(df.head())
-    """, language="python")
-    st.subheader("b) Handling Parse Errors")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>If extra value is added to a row a <code> Parsing Error </code> </li>
-        <li>It happens when we create csv with the help of <code> text editors </code> .</li>
-        <li>If we add extra value to row it don't throw error instead it creates the new column for extra value it fills with <b> null</b> when converted from excel to csv.</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    st.markdown("""
-    <p><b>Solution:</b> Use the <code>on_bad_lines</code> parameter in pandas:</p>
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li><code>"error"</code>: Stops the program and raises an error.</li>
-        <li><code>"skip"</code>: Skips rows with errors.</li>
-        <li><code>"warn"</code>: Skips rows with errors and shows the line numbers.</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    # Code example for handling parse errors
-    st.code("""
-# Skip bad lines
-df = pd.read_csv("file.csv", on_bad_lines="skip")
-# Warn about bad lines
-df = pd.read_csv("file.csv", on_bad_lines="warn")
-    """, language="python")
-    st.subheader("c) Unicode Decode Error")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>Each character, when saved, is represented by a unique number (ASCII/Unicode code point).</li>
-        <li> ord("a") → 97 , bin(97) → 0b1100001 (Binary representation of 'a') </li>
-        <li>Characters are saved in memory using a specific encoding, typically UTF-8 by default.</li>
-        <li>Unicode Decode Error: Occurs when the system is unable to decode a file due to an incorrect or incompatible encoding.To solve this, you need to find the appropriate encoding for the file.</li>
-        <li>Python uses utf-8 by default for encoding, but files may be saved with other encodings.</li>
-        <li><code>Using the encodings module</code>: To explore the available encodings, you can import encodings in Python</li>
-        <li> There are <code>326</code> different encoding aliases available in Python, which can be accessed via <code>encodings.aliases.aliases.,/code></li>
-    </ul>
-    """, unsafe_allow_html=True)
-    # Code example for trying multiple encodings
-    st.code("""
-import encodings
-# Get all encodings
-encodings_list = list(encodings.aliases.aliases.keys())
-# Try reading the file with different encodings
-for encoding in encodings_list:
-    try:
-        df = pd.read_csv("file.csv", encoding=encoding)
-        print(f"Success with encoding: {encoding}")
-        break
-    except:
-        pass  # Skip to the next encoding
-    """, language="python")
-    st.subheader("Lookup Error:")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>Occurs if you try to access an encoding that is not available or supported.</li>
-        <li>Use a try-except block to handle it gracefully</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    st.code('''
-          except LookupError:
-              print("Incorrect Encoding".format(y))
-              ''')
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>After this when we get <code> Parse error </code> to solve that error add <code> on_badlines = "skip" parametre </code> .</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    st.subheader("d) Handling Large CSV Files")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li>When working with large CSV files, the file might not fit into memory, leading to a <code>MemoryError</code>.</li>
-        <li><code>Solution: Use chunksize to break the file into smaller chunks.</code></li>
-        <li>: To handle each chunk, you can iterate through the chunks and process them as needed.</li>
-    </ul>
-    """, unsafe_allow_html=True)
-    # Code example for handling large files
-    st.code("""
-chunk_size = 100
-chunks = pd.read_csv("large_file.csv", chunksize=chunk_size)
-for i, chunk in enumerate(chunks):
-    print(f"Processing chunk {i + 1} with {chunk.shape[0]} rows")
-    """, language="python")
-    st.header("3. Summary")
-    st.markdown("""
-    <ul style="font-family: Arial; line-height: 1.6;">
-        <li><b>Parse Errors:</b> Use <code>on_bad_lines</code> to handle them (<code>skip</code> or <code>warn</code>).</li>
-        <li><b>Encoding Issues:</b> Try different encodings to fix <b>UnicodeDecodeError</b>.</li>
-        <li><b>Large Files:</b> Use <code>chunksize</code> to process files in smaller parts.</li>
-    </ul>
-    """, unsafe_allow_html=True)
- # Button to go back to the main page
-    if st.button("Back to Home"):
-        st.session_state['page'] = "home"
-# Main page function
-def main_page():
-    # Buttons for navigation
-    if st.button("Go to Structured Data - Excel"):
-        st.session_state['page'] = "excel_details"
-    if st.button("Go to Semi-Structured Data - CSV"):
-        st.session_state['page'] = "csv_details"
-# Initialize session state
-if 'page' not in st.session_state:
-    st.session_state['page'] = "home"
-# Route to the appropriate page
-if st.session_state['page'] == "home":
-    main_page()
-elif st.session_state['page'] == "excel_details":
-    excel_details_page()
-elif st.session_state['page'] == "csv_details":
-    csv_details_page()

 import streamlit as st
+# App title
+st.title("Working with HTML Data using Python")
+# HTML and DataFrames Section
+st.header("HTML and DataFrames")
+st.write("""
+- **HTML (HyperText Markup Language)** is a semi-structured data format.
+- HTML uses tags like `<table>`, `<tr>`, `<th>`, and `<td>` to structure tabular data.
+- Unlike XML, HTML does not allow creating custom tags freely.
+- Not all HTML content can be converted into dataframes, especially paragraph text or unstructured data.
+- Typically, only table-related elements (`<table>`, `<tr>`, `<th>`, `<td>`) can be converted into dataframes.
+""")
+# Reading HTML Files Section
+st.header("Reading HTML Files into DataFrames")
+st.write("**Reading HTML Files:**")
+st.code("""
 import pandas as pd
+tables = pd.read_html(path_or_buffer)
+""", language="python")
+st.write("""
+- **`pd.read_html(path_or_buffer)`** reads HTML files or websites containing tables.
+- Extracts all tables and returns them as a list of dataframes.
+""")
+st.write("**Accessing Specific Tables:**")
+st.code("""
+# Accessing the first table from the list
+table = tables[0]
+""", language="python")
+st.write("""
+- Each table is stored in the list by index.
+- Use indexing to select the table you want to work with.
+""")
+st.write("**Limitations:**")
+st.write("""
+- Not all websites or HTML files can be read, even if they have tables.
+- Issues like authorization restrictions can prevent reading certain tables.
+""")
+st.write("**Using the `match` Parameter:**")
+st.code("""
+# Reading a specific table using the match parameter
+tables = pd.read_html(path, match="keyword")
+""", language="python")
+st.write("""
+- To locate specific tables, use `match="keyword"` while reading HTML.
+- The `match` parameter searches for tables containing the specified keyword.
+""")
+# Exporting DataFrames Section
+st.header("Exporting DataFrames to HTML")
+st.write("**Exporting DataFrame to HTML:**")
+st.code("""
+# Exporting a dataframe to an HTML file
+df.to_html("output.html")
+""", language="python")
+st.write("""
+- Converts a dataframe into an HTML file.
+- Saves the dataframe in an HTML-compatible table format at the specified path.
+""")