Machine_learning / pages /6_Semi_structured_data.py
Harika22's picture
Update pages/6_Semi_structured_data.py
3ff1286 verified
raw
history blame
15.4 kB
import streamlit as st
import pandas as pd
st.markdown("""
<style>
/* Set a soft background color */
body {
background-color: #eef2f7;
}
/* Style for main title */
h1 {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 700;
text-align: center;
margin-bottom: 25px;
}
/* Style for headers */
h2 {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 600;
margin-top: 30px;
}
/* Style for subheaders */
h3 {
color: red;
font-family: 'Roboto', sans-serif;
font-weight: 500;
margin-top: 20px;
}
.custom-subheader {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 600;
margin-bottom: 15px;
}
/* Paragraph styling */
p {
font-family: 'Georgia', serif;
line-height: 1.8;
color: black;
margin-bottom: 20px;
}
/* List styling with checkmark bullets */
.icon-bullet {
list-style-type: none;
padding-left: 20px;
}
.icon-bullet li {
font-family: 'Georgia', serif;
font-size: 1.1em;
margin-bottom: 10px;
color: black;
}
.icon-bullet li::before {
content: "◆";
padding-right: 10px;
color: black;
}
/* Sidebar styling */
.sidebar .sidebar-content {
background-color: #ffffff;
border-radius: 10px;
padding: 15px;
}
.sidebar h2 {
color: #495057;
}
/* Custom button style */
.streamlit-button {
background-color: #00FFFF;
color: #000000;
font-weight: bold;
}
</style>
""", unsafe_allow_html=True)
st.subheader("Semi-Structured Data")
st.markdown("""
Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
<ul class="icon-bullet">
<li>CSV </li>
<li>JSON </li>
<li>HTML </li>
<li>XML</li>
</ul>
""", unsafe_allow_html=True)
st.sidebar.title("Navigation 🧭")
file_type = st.sidebar.radio(
"Choose a file type :",
("CSV", "XML", "JSON", "HTML"))
if file_type == "CSV":
st.title("CSV")
st.markdown('''
- **CSV (Comma-Separated Values)**
- CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
- It is commonly used for data exchange between applications, such as spreadsheets and databases.
- CSV files are saved with the `.csv` extension
''')
st.header('**Issues in CSV:**')
st.subheader('''**1. ParserError:**''')
st.markdown('''- This error occurs when we have extra column.
- This error is mostly occurs when CSV is created in Text editor.
- To overcome parse error we use a parameter known as on_bad_lines where default it takes error
- on_bad_lines = "error" -- default
- on_bad_lines = "skip" -- skip unnecessary rows
- on_bad_lines = "warn" -- skip unnecessary rows but warns
''')
st.subheader('**Solution:**')
st.code('''import pandas as pd
pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
''')
st.write('------------------------------------------------')
st.subheader('''**2.Encoding:**''')
st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
- To preserve the information of characters and that error is Unicode-decode error
- If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
- Most of the csv will be in UTF-8, but not all csv.''')
st.subheader('**Solution:**')
st.code('''import pandas as pd
import encodings
l=encodings.aliases.aliases.keys() # list of all encodings
for y in l:
try:
pd.read_csv('sample.csv',encoding='utf-8')
print('{} is an correct encoding')
except UnicodeDecodeError:
print('{} is not an correct encoding'.format(y))
except LookUpError:
print('{} is not supported'.format(y))
''')
st.write('------------------------------------------------')
st.subheader('''**3. Out of memory:**''')
st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
- CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
- Chunks are the part of the data, which takes chunksize as a number of rows.
- If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
- Its output will be in generator.
- Generator can return multiple values, it uses yield instead of return
- All chunks are stored as objects and the object is iterabel''')
st.subheader('**Solution:**')
st.code('''
import pandas as pd
pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
''')
st.subheader('''**4. Takes long time to load a huge dataset:**''')
st.markdown("It takes long time to load a huge dataset")
st.subheader('**Solution:**')
st.markdown('''Polars : as it is replica of pandas
- It is faster than pandas
''')
elif file_type == "XML":
st.title("XML")
st.markdown('''
- XML is an Extensible Markup Language
- In XML, we can define our own tags
- XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data.
- It uses tags to define elements and attributes, making it both human-readable and machine-readable.
as **Extensible** Markup Language
''')
# Example : XML Structure
st.subheader('**XML Structure**')
st.markdown('''
A simple XML file
''')
st.code('''
<data>
<person>
<name>Harika</name>
<age>21</age>
<height>145</height>
</person>
<person>
<name>sreeja/name>
<age>22</age>
<height>153</height>
</person>
</data>
''')
st.code('''
import pandas as pd
# Example: Reading a XML file
df = pd.read_xml('data.xml', xpath='/data/person')
print(df)
''')
st.markdown('''
The output DataFrame will look like this:
| name | age | height |
|----------------|------------|------ |
| Harika | 21 | 145 |
| sreeja | 22 | 153 |
''')
st.markdown('''
**`xpath` parameter**:
- Specifies the XML path to extract specific elements.
- For example:
- `xpath='/data/person'`: Extracts all `<person>` elements from `<data>`. ''')
# Example 2: Nested XML Structure
st.subheader('**Nested XML Structure**')
st.markdown('''
A more complex XML file with nested elements and attributes.
''')
st.code('''
<company>
<department id="1" name="HR">
<employee>
<name>John Doe</name>
<position>Manager</position>
</employee>
<employee>
<name>Jane Smith</name>
<position>Assistant</position>
</employee>
</department>
<department id="2" name="Engineering">
<employee>
<name>Emily Johnson</name>
<position>Engineer</position>
</employee>
</department>
</company>
''')
st.code('''
import pandas as pd
# Example: Reading a nested XML file
df = pd.read_xml(
'nested.xml',
xpath='.//employee',
elem_cols=['name', 'position'],
attr_cols=['id', 'name']
)
print(df)
''')
st.markdown('''
The output DataFrame will look like this:
| id | department name | name | position |
|----|-----------------|---------------|------------|
| 1 | HR | John Doe | Manager |
| 1 | HR | Jane Smith | Assistant |
| 2 | Engineering | Emily Johnson | Engineer |
''')
st.markdown('''
1. **`elem_cols` parameter**:
- Specifies the child tags (elements) you want to include in the DataFrame.
- Example:
- `elem_cols=['name', 'position']`: Extracts `<name>` and `<position>` from `<employee>` tags.
2. **`attr_cols` parameter**:
- Specifies the attributes of the parent elements to include in the DataFrame.
- Example:
- `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `<department>` tag.
''')
st.markdown('''
By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames.
''')
elif file_type == "JSON":
st.title("JSON")
st.markdown('''
- JSON **(Javascript Orient Notation)**
- It is text-based data format used to store and exchange data, structured as key-value pairs and arrays
- JSON can be both
- **Structured** and **Semi-structured**
- Default json format is dictionary format
- Key should always be in string
- Reading json files is ** pd.read_json() ** it takes only string
''')
st.header("Structred JSON Format")
st.markdown('''
- A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability
- In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON.
- It determines the layout of the JSON representation.
- Common orient types are
- ◆ Index
- ◆ Columns
- ◆ Values
- ◆ Split
''')
st.subheader("How to read Structured JSON Format?...")
st.code('''import pandas as pd
a = '{"name":["harii","sree"],"age":[12,13]}'
pd.read_json(a)
''')
st.header("Converting DataFrame into JSON...")
st.subheader("Orient as Index...")
st.markdown('''
- In orient as index while converting DataFrame into json here keys are index and values are dictionary
- Inside dictionary keys are column names and values are values present in the data
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
ind = data.to_json(orient="index")
ind
''')
st.markdown('''
- Output will be :
''')
st.code('''
output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}'
''')
st.subheader("Orient as Columns...")
st.markdown('''
- In orient as columns while converting DataFrame into json here keys become column names and values are dictionary
- Inside dictionary keys are indices and values are values present in the data
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
col = data.to_json(orient="columns")
col
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}'
''')
st.subheader("Orient as Values...")
st.markdown('''
- In orient as values while converting DataFrame into json it gives you list of list (nested list)
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
val = data.to_json(orient="values")
val
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]'
''')
st.subheader("Orient as Split...")
st.markdown('''
- In orient as split while converting DataFrame into json it gives column names , indices and data seperately
- It is a dictionary of list
''')
st.code('''import pandas as pd'
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
spl = data.to_json(orient="split")
spl
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}'
''')
st.subheader("**Issues in Structured JSON Format**")
st.markdown('''
- As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize()
- To handle this issue we use semi-structured json format which can handle nested structures
''')
st.header("Semi-structured JSON Format")
st.markdown('''
- A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures
- It takes list of dictionaries where each dict will be acting as a single row
- Semi-structured json format has different types to convert dataframe into json
- ◆ max_level ---> how much deeper it takes to take the values of column
- ◆ record_path ---> only used when values are in list of dictionary
- ◆ meta ---> it is used to get remaining columns
''')
st.subheader("How to read Semi-structured JSON Format?...")
st.code('''import pandas as pd
b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}}
pd.json_normalize(b)
''')
st.header("Converting DataFrame into JSON...")
st.subheader("Using max_level... ")
st.code('''import pandas as pd
a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}}
pd.json_normalize(a)
pd.json_normalize(a,max_level=1)
''')
st.markdown('''
- **max_level** gives how much deeper it takes to take the values of column
''')
st.markdown('''
- Output will be:
''')
st.markdown('''
| name | age | marks.sem1 | marks.sem2 |
|-------|-----|------------------------------|------------------------------|
| harii | 23 | {'hindi': 10} | {'hindi': 12 } |
''')
st.subheader("Using record_path and meta...")
st.code('''import pandas as pd
x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}]
pd.json_normalize(x,record_path="marks",meta=["name","age"])
''')
st.markdown('''
- **record_path** only used when values are in list of dictionary
- **meta** is used to get remaining columns
''')
st.markdown('''
- Output will be:
''')
st.markdown('''
| maths | hindi | name | age |
|-------|-------|------|-----|
| 11 | 41 | p1 | 22 |
| 22 | 31 | p1 | 21 |
''')