Spaces:
Sleeping
Sleeping
| import streamlit as st | |
| import pandas as pd | |
| st.markdown(""" | |
| <style> | |
| /* Set a soft background color */ | |
| body { | |
| background-color: #eef2f7; | |
| } | |
| /* Style for main title */ | |
| h1 { | |
| color: black; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 700; | |
| text-align: center; | |
| margin-bottom: 25px; | |
| } | |
| /* Style for headers */ | |
| h2 { | |
| color: black; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 600; | |
| margin-top: 30px; | |
| } | |
| /* Style for subheaders */ | |
| h3 { | |
| color: red; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 500; | |
| margin-top: 20px; | |
| } | |
| .custom-subheader { | |
| color: black; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 600; | |
| margin-bottom: 15px; | |
| } | |
| /* Paragraph styling */ | |
| p { | |
| font-family: 'Georgia', serif; | |
| line-height: 1.8; | |
| color: black; | |
| margin-bottom: 20px; | |
| } | |
| /* List styling with checkmark bullets */ | |
| .icon-bullet { | |
| list-style-type: none; | |
| padding-left: 20px; | |
| } | |
| .icon-bullet li { | |
| font-family: 'Georgia', serif; | |
| font-size: 1.1em; | |
| margin-bottom: 10px; | |
| color: black; | |
| } | |
| .icon-bullet li::before { | |
| content: "β"; | |
| padding-right: 10px; | |
| color: black; | |
| } | |
| /* Sidebar styling */ | |
| .sidebar .sidebar-content { | |
| background-color: #ffffff; | |
| border-radius: 10px; | |
| padding: 15px; | |
| } | |
| .sidebar h2 { | |
| color: #495057; | |
| } | |
| /* Custom button style */ | |
| .streamlit-button { | |
| background-color: #00FFFF; | |
| color: #000000; | |
| font-weight: bold; | |
| } | |
| </style> | |
| """, unsafe_allow_html=True) | |
| st.subheader("Semi-Structured Data") | |
| st.markdown(""" | |
| Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples : | |
| <ul class="icon-bullet"> | |
| <li>CSV </li> | |
| <li>JSON </li> | |
| <li>HTML </li> | |
| <li>XML</li> | |
| </ul> | |
| """, unsafe_allow_html=True) | |
| st.sidebar.title("Navigation π§") | |
| file_type = st.sidebar.radio( | |
| "Choose a file type :", | |
| ("CSV", "XML", "JSON", "HTML")) | |
| if file_type == "CSV": | |
| st.title("CSV") | |
| st.markdown(''' | |
| - **CSV (Comma-Separated Values)** | |
| - CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas. | |
| - It is commonly used for data exchange between applications, such as spreadsheets and databases. | |
| - CSV files are saved with the `.csv` extension | |
| ''') | |
| st.header('**Issues in CSV:**') | |
| st.subheader('''**1. ParserError:**''') | |
| st.markdown('''- This error occurs when we have extra column. | |
| - This error is mostly occurs when CSV is created in Text editor. | |
| - To overcome parse error we use a parameter known as on_bad_lines where default it takes error | |
| - on_bad_lines = "error" -- default | |
| - on_bad_lines = "skip" -- skip unnecessary rows | |
| - on_bad_lines = "warn" -- skip unnecessary rows but warns | |
| ''') | |
| st.subheader('**Solution:**') | |
| st.code('''import pandas as pd | |
| pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error') | |
| ''') | |
| st.write('------------------------------------------------') | |
| st.subheader('''**2.Encoding:**''') | |
| st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number. | |
| - To preserve the information of characters and that error is Unicode-decode error | |
| - If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information. | |
| - Most of the csv will be in UTF-8, but not all csv.''') | |
| st.subheader('**Solution:**') | |
| st.code('''import pandas as pd | |
| import encodings | |
| l=encodings.aliases.aliases.keys() # list of all encodings | |
| for y in l: | |
| try: | |
| pd.read_csv('sample.csv',encoding='utf-8') | |
| print('{} is an correct encoding') | |
| except UnicodeDecodeError: | |
| print('{} is not an correct encoding'.format(y)) | |
| except LookUpError: | |
| print('{} is not supported'.format(y)) | |
| ''') | |
| st.write('------------------------------------------------') | |
| st.subheader('''**3. Out of memory:**''') | |
| st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks. | |
| - CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks | |
| - Chunks are the part of the data, which takes chunksize as a number of rows. | |
| - If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks. | |
| - Its output will be in generator. | |
| - Generator can return multiple values, it uses yield instead of return | |
| - All chunks are stored as objects and the object is iterabel''') | |
| st.subheader('**Solution:**') | |
| st.code(''' | |
| import pandas as pd | |
| pd.read_csv('spam.csv', encoding='latin', chunksize= 100) | |
| ''') | |
| st.subheader('''**4. Takes long time to load a huge dataset:**''') | |
| st.markdown("It takes long time to load a huge dataset") | |
| st.subheader('**Solution:**') | |
| st.markdown('''Polars : as it is replica of pandas | |
| - It is faster than pandas | |
| ''') | |
| elif file_type == "XML": | |
| st.title("XML") | |
| st.markdown(''' | |
| - XML is an Extensible Markup Language | |
| - In XML, we can define our own tags | |
| - XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data. | |
| - It uses tags to define elements and attributes, making it both human-readable and machine-readable. | |
| as **Extensible** Markup Language | |
| ''') | |
| # Example : XML Structure | |
| st.subheader('**XML Structure**') | |
| st.markdown(''' | |
| A simple XML file | |
| ''') | |
| st.code(''' | |
| <data> | |
| <person> | |
| <name>Harika</name> | |
| <age>21</age> | |
| <height>145</height> | |
| </person> | |
| <person> | |
| <name>sreeja/name> | |
| <age>22</age> | |
| <height>153</height> | |
| </person> | |
| </data> | |
| ''') | |
| st.code(''' | |
| import pandas as pd | |
| # Example: Reading a XML file | |
| df = pd.read_xml('data.xml', xpath='/data/person') | |
| print(df) | |
| ''') | |
| st.markdown(''' | |
| The output DataFrame will look like this: | |
| | name | age | height | | |
| |----------------|------------|------ | | |
| | Harika | 21 | 145 | | |
| | sreeja | 22 | 153 | | |
| ''') | |
| st.markdown(''' | |
| **`xpath` parameter**: | |
| - Specifies the XML path to extract specific elements. | |
| - For example: | |
| - `xpath='/data/person'`: Extracts all `<person>` elements from `<data>`. ''') | |
| # Example 2: Nested XML Structure | |
| st.subheader('**Nested XML Structure**') | |
| st.markdown(''' | |
| A more complex XML file with nested elements and attributes. | |
| ''') | |
| st.code(''' | |
| <company> | |
| <department id="1" name="HR"> | |
| <employee> | |
| <name>John Doe</name> | |
| <position>Manager</position> | |
| </employee> | |
| <employee> | |
| <name>Jane Smith</name> | |
| <position>Assistant</position> | |
| </employee> | |
| </department> | |
| <department id="2" name="Engineering"> | |
| <employee> | |
| <name>Emily Johnson</name> | |
| <position>Engineer</position> | |
| </employee> | |
| </department> | |
| </company> | |
| ''') | |
| st.code(''' | |
| import pandas as pd | |
| # Example: Reading a nested XML file | |
| df = pd.read_xml( | |
| 'nested.xml', | |
| xpath='.//employee', | |
| elem_cols=['name', 'position'], | |
| attr_cols=['id', 'name'] | |
| ) | |
| print(df) | |
| ''') | |
| st.markdown(''' | |
| The output DataFrame will look like this: | |
| | id | department name | name | position | | |
| |----|-----------------|---------------|------------| | |
| | 1 | HR | John Doe | Manager | | |
| | 1 | HR | Jane Smith | Assistant | | |
| | 2 | Engineering | Emily Johnson | Engineer | | |
| ''') | |
| st.markdown(''' | |
| 1. **`elem_cols` parameter**: | |
| - Specifies the child tags (elements) you want to include in the DataFrame. | |
| - Example: | |
| - `elem_cols=['name', 'position']`: Extracts `<name>` and `<position>` from `<employee>` tags. | |
| 2. **`attr_cols` parameter**: | |
| - Specifies the attributes of the parent elements to include in the DataFrame. | |
| - Example: | |
| - `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `<department>` tag. | |
| ''') | |
| st.markdown(''' | |
| By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames. | |
| ''') | |
| elif file_type == "JSON": | |
| st.title("JSON") | |
| st.markdown(''' | |
| - JSON **(Javascript Orient Notation)** | |
| - It is text-based data format used to store and exchange data, structured as key-value pairs and arrays | |
| - JSON can be both | |
| - **Structured** and **Semi-structured** | |
| - Default json format is dictionary format | |
| - Key should always be in string | |
| - Reading json files is ** pd.read_json() ** it takes only string | |
| ''') | |
| st.header("Structred JSON Format") | |
| st.markdown(''' | |
| - A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability | |
| - In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON. | |
| - It determines the layout of the JSON representation. | |
| - Common orient types are | |
| - β Index | |
| - β Columns | |
| - β Values | |
| - β Split | |
| ''') | |
| st.subheader("How to read Structured JSON Format?...") | |
| st.code('''import pandas as pd | |
| a = '{"name":["harii","sree"],"age":[12,13]}' | |
| pd.read_json(a) | |
| ''') | |
| st.header("Converting DataFrame into JSON...") | |
| st.subheader("Orient as Index...") | |
| st.markdown(''' | |
| - In orient as index while converting DataFrame into json here keys are index and values are dictionary | |
| - Inside dictionary keys are column names and values are values present in the data | |
| ''') | |
| st.code('''import pandas as pd | |
| a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' | |
| data = pd.read_json(a) | |
| ind = data.to_json(orient="index") | |
| ind | |
| ''') | |
| st.markdown(''' | |
| - Output will be : | |
| ''') | |
| st.code(''' | |
| output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}' | |
| ''') | |
| st.subheader("Orient as Columns...") | |
| st.markdown(''' | |
| - In orient as columns while converting DataFrame into json here keys become column names and values are dictionary | |
| - Inside dictionary keys are indices and values are values present in the data | |
| ''') | |
| st.code('''import pandas as pd | |
| a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' | |
| data = pd.read_json(a) | |
| col = data.to_json(orient="columns") | |
| col | |
| ''') | |
| st.markdown(''' | |
| - Output will be: | |
| ''') | |
| st.code(''' | |
| output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}' | |
| ''') | |
| st.subheader("Orient as Values...") | |
| st.markdown(''' | |
| - In orient as values while converting DataFrame into json it gives you list of list (nested list) | |
| ''') | |
| st.code('''import pandas as pd | |
| a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' | |
| data = pd.read_json(a) | |
| val = data.to_json(orient="values") | |
| val | |
| ''') | |
| st.markdown(''' | |
| - Output will be: | |
| ''') | |
| st.code(''' | |
| output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]' | |
| ''') | |
| st.subheader("Orient as Split...") | |
| st.markdown(''' | |
| - In orient as split while converting DataFrame into json it gives column names , indices and data seperately | |
| - It is a dictionary of list | |
| ''') | |
| st.code('''import pandas as pd' | |
| a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' | |
| data = pd.read_json(a) | |
| spl = data.to_json(orient="split") | |
| spl | |
| ''') | |
| st.markdown(''' | |
| - Output will be: | |
| ''') | |
| st.code(''' | |
| output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}' | |
| ''') | |
| st.subheader("**Issues in Structured JSON Format**") | |
| st.markdown(''' | |
| - As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize() | |
| - To handle this issue we use semi-structured json format which can handle nested structures | |
| ''') | |
| st.header("Semi-structured JSON Format") | |
| st.markdown(''' | |
| - A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures | |
| - It takes list of dictionaries where each dict will be acting as a single row | |
| - Semi-structured json format has different types to convert dataframe into json | |
| - β max_level ---> how much deeper it takes to take the values of column | |
| - β record_path ---> only used when values are in list of dictionary | |
| - β meta ---> it is used to get remaining columns | |
| ''') | |
| st.subheader("How to read Semi-structured JSON Format?...") | |
| st.code('''import pandas as pd | |
| b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}} | |
| pd.json_normalize(b) | |
| ''') | |
| st.header("Converting DataFrame into JSON...") | |
| st.subheader("Using max_level... ") | |
| st.code('''import pandas as pd | |
| a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}} | |
| pd.json_normalize(a) | |
| pd.json_normalize(a,max_level=1) | |
| ''') | |
| st.markdown(''' | |
| - **max_level** gives how much deeper it takes to take the values of column | |
| ''') | |
| st.subheader("Using record_path and meta...") | |
| st.code('''import pandas as pd | |
| x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}] | |
| pd.json_normalize(x,record_path="marks",meta=["name","age"]) | |
| ''') | |
| st.markdown(''' | |
| - **record_path** only used when values are in list of dictionary | |
| - **meta** is used to get remaining columns | |
| ''') | |
| st.markdown(''' | |
| - Output will be: | |
| ''') | |
| st.markdown(''' | |
| | maths | hindi | name | age | | |
| |-------|-------|------|-----| | |
| | 11 | 41 | p1 | 22 | | |
| | 22 | 31 | p1 | 21 | | |
| ''') | |
| elif file_type == "HTML": | |
| st.title("HTML") | |
| st.markdown(''' | |
| - HTML **(Hypertext Markup Language)** | |
| - HTML (HyperText Markup Language) is the standard language used to create and structure content on the web, using tags to define elements such as text, images, links, and other multimedia. | |
| ''') | |
| st.subheader("How to read and get the tabular data from the URLs?...") | |
| st.code('''import pandas as pd | |
| data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League") | |
| data | |
| ''') | |
| st.markdown(''' | |
| - It gives all the tables related to Indian_Premier_League | |
| - But if we want to get one particular table amongst all tables we need to give unique word related to that particular table we needed | |
| ''') | |
| st.code('''import pandas as pd | |
| data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League",match="Mitchell Starc") | |
| data | |
| ''') | |
| st.markdown(''' | |
| - It gives the particular table which has the word matching as "Mitchell Starc" | |
| ''') | |