Machine_learning / pages /6_Semi_structured_data.py
Harika22's picture
Update pages/6_Semi_structured_data.py
4100473 verified
import streamlit as st
import pandas as pd
st.markdown("""
<style>
/* Set a soft background color */
body {
background-color: #eef2f7;
}
/* Style for main title */
h1 {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 700;
text-align: center;
margin-bottom: 25px;
}
/* Style for headers */
h2 {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 600;
margin-top: 30px;
}
/* Style for subheaders */
h3 {
color: red;
font-family: 'Roboto', sans-serif;
font-weight: 500;
margin-top: 20px;
}
.custom-subheader {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 600;
margin-bottom: 15px;
}
/* Paragraph styling */
p {
font-family: 'Georgia', serif;
line-height: 1.8;
color: black;
margin-bottom: 20px;
}
/* List styling with checkmark bullets */
.icon-bullet {
list-style-type: none;
padding-left: 20px;
}
.icon-bullet li {
font-family: 'Georgia', serif;
font-size: 1.1em;
margin-bottom: 10px;
color: black;
}
.icon-bullet li::before {
content: "β—†";
padding-right: 10px;
color: black;
}
/* Sidebar styling */
.sidebar .sidebar-content {
background-color: #ffffff;
border-radius: 10px;
padding: 15px;
}
.sidebar h2 {
color: #495057;
}
/* Custom button style */
.streamlit-button {
background-color: #00FFFF;
color: #000000;
font-weight: bold;
}
</style>
""", unsafe_allow_html=True)
st.subheader("Semi-Structured Data")
st.markdown("""
Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
<ul class="icon-bullet">
<li>CSV </li>
<li>JSON </li>
<li>HTML </li>
<li>XML</li>
</ul>
""", unsafe_allow_html=True)
st.sidebar.title("Navigation 🧭")
file_type = st.sidebar.radio(
"Choose a file type :",
("CSV", "XML", "JSON", "HTML"))
if file_type == "CSV":
st.title("CSV")
st.markdown('''
- **CSV (Comma-Separated Values)**
- CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
- It is commonly used for data exchange between applications, such as spreadsheets and databases.
- CSV files are saved with the `.csv` extension
''')
st.header('**Issues in CSV:**')
st.subheader('''**1. ParserError:**''')
st.markdown('''- This error occurs when we have extra column.
- This error is mostly occurs when CSV is created in Text editor.
- To overcome parse error we use a parameter known as on_bad_lines where default it takes error
- on_bad_lines = "error" -- default
- on_bad_lines = "skip" -- skip unnecessary rows
- on_bad_lines = "warn" -- skip unnecessary rows but warns
''')
st.subheader('**Solution:**')
st.code('''import pandas as pd
pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
''')
st.write('------------------------------------------------')
st.subheader('''**2.Encoding:**''')
st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
- To preserve the information of characters and that error is Unicode-decode error
- If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
- Most of the csv will be in UTF-8, but not all csv.''')
st.subheader('**Solution:**')
st.code('''import pandas as pd
import encodings
l=encodings.aliases.aliases.keys() # list of all encodings
for y in l:
try:
pd.read_csv('sample.csv',encoding='utf-8')
print('{} is an correct encoding')
except UnicodeDecodeError:
print('{} is not an correct encoding'.format(y))
except LookUpError:
print('{} is not supported'.format(y))
''')
st.write('------------------------------------------------')
st.subheader('''**3. Out of memory:**''')
st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
- CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
- Chunks are the part of the data, which takes chunksize as a number of rows.
- If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
- Its output will be in generator.
- Generator can return multiple values, it uses yield instead of return
- All chunks are stored as objects and the object is iterabel''')
st.subheader('**Solution:**')
st.code('''
import pandas as pd
pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
''')
st.subheader('''**4. Takes long time to load a huge dataset:**''')
st.markdown("It takes long time to load a huge dataset")
st.subheader('**Solution:**')
st.markdown('''Polars : as it is replica of pandas
- It is faster than pandas
''')
elif file_type == "XML":
st.title("XML")
st.markdown('''
- XML is an Extensible Markup Language
- In XML, we can define our own tags
- XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data.
- It uses tags to define elements and attributes, making it both human-readable and machine-readable.
as **Extensible** Markup Language
''')
# Example : XML Structure
st.subheader('**XML Structure**')
st.markdown('''
A simple XML file
''')
st.code('''
<data>
<person>
<name>Harika</name>
<age>21</age>
<height>145</height>
</person>
<person>
<name>sreeja/name>
<age>22</age>
<height>153</height>
</person>
</data>
''')
st.code('''
import pandas as pd
# Example: Reading a XML file
df = pd.read_xml('data.xml', xpath='/data/person')
print(df)
''')
st.markdown('''
The output DataFrame will look like this:
| name | age | height |
|----------------|------------|------ |
| Harika | 21 | 145 |
| sreeja | 22 | 153 |
''')
st.markdown('''
**`xpath` parameter**:
- Specifies the XML path to extract specific elements.
- For example:
- `xpath='/data/person'`: Extracts all `<person>` elements from `<data>`. ''')
# Example 2: Nested XML Structure
st.subheader('**Nested XML Structure**')
st.markdown('''
A more complex XML file with nested elements and attributes.
''')
st.code('''
<company>
<department id="1" name="HR">
<employee>
<name>John Doe</name>
<position>Manager</position>
</employee>
<employee>
<name>Jane Smith</name>
<position>Assistant</position>
</employee>
</department>
<department id="2" name="Engineering">
<employee>
<name>Emily Johnson</name>
<position>Engineer</position>
</employee>
</department>
</company>
''')
st.code('''
import pandas as pd
# Example: Reading a nested XML file
df = pd.read_xml(
'nested.xml',
xpath='.//employee',
elem_cols=['name', 'position'],
attr_cols=['id', 'name']
)
print(df)
''')
st.markdown('''
The output DataFrame will look like this:
| id | department name | name | position |
|----|-----------------|---------------|------------|
| 1 | HR | John Doe | Manager |
| 1 | HR | Jane Smith | Assistant |
| 2 | Engineering | Emily Johnson | Engineer |
''')
st.markdown('''
1. **`elem_cols` parameter**:
- Specifies the child tags (elements) you want to include in the DataFrame.
- Example:
- `elem_cols=['name', 'position']`: Extracts `<name>` and `<position>` from `<employee>` tags.
2. **`attr_cols` parameter**:
- Specifies the attributes of the parent elements to include in the DataFrame.
- Example:
- `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `<department>` tag.
''')
st.markdown('''
By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames.
''')
elif file_type == "JSON":
st.title("JSON")
st.markdown('''
- JSON **(Javascript Orient Notation)**
- It is text-based data format used to store and exchange data, structured as key-value pairs and arrays
- JSON can be both
- **Structured** and **Semi-structured**
- Default json format is dictionary format
- Key should always be in string
- Reading json files is ** pd.read_json() ** it takes only string
''')
st.header("Structred JSON Format")
st.markdown('''
- A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability
- In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON.
- It determines the layout of the JSON representation.
- Common orient types are
- β—† Index
- β—† Columns
- β—† Values
- β—† Split
''')
st.subheader("How to read Structured JSON Format?...")
st.code('''import pandas as pd
a = '{"name":["harii","sree"],"age":[12,13]}'
pd.read_json(a)
''')
st.header("Converting DataFrame into JSON...")
st.subheader("Orient as Index...")
st.markdown('''
- In orient as index while converting DataFrame into json here keys are index and values are dictionary
- Inside dictionary keys are column names and values are values present in the data
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
ind = data.to_json(orient="index")
ind
''')
st.markdown('''
- Output will be :
''')
st.code('''
output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}'
''')
st.subheader("Orient as Columns...")
st.markdown('''
- In orient as columns while converting DataFrame into json here keys become column names and values are dictionary
- Inside dictionary keys are indices and values are values present in the data
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
col = data.to_json(orient="columns")
col
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}'
''')
st.subheader("Orient as Values...")
st.markdown('''
- In orient as values while converting DataFrame into json it gives you list of list (nested list)
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
val = data.to_json(orient="values")
val
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]'
''')
st.subheader("Orient as Split...")
st.markdown('''
- In orient as split while converting DataFrame into json it gives column names , indices and data seperately
- It is a dictionary of list
''')
st.code('''import pandas as pd'
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
spl = data.to_json(orient="split")
spl
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}'
''')
st.subheader("**Issues in Structured JSON Format**")
st.markdown('''
- As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize()
- To handle this issue we use semi-structured json format which can handle nested structures
''')
st.header("Semi-structured JSON Format")
st.markdown('''
- A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures
- It takes list of dictionaries where each dict will be acting as a single row
- Semi-structured json format has different types to convert dataframe into json
- β—† max_level ---> how much deeper it takes to take the values of column
- β—† record_path ---> only used when values are in list of dictionary
- β—† meta ---> it is used to get remaining columns
''')
st.subheader("How to read Semi-structured JSON Format?...")
st.code('''import pandas as pd
b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}}
pd.json_normalize(b)
''')
st.header("Converting DataFrame into JSON...")
st.subheader("Using max_level... ")
st.code('''import pandas as pd
a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}}
pd.json_normalize(a)
pd.json_normalize(a,max_level=1)
''')
st.markdown('''
- **max_level** gives how much deeper it takes to take the values of column
''')
st.subheader("Using record_path and meta...")
st.code('''import pandas as pd
x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}]
pd.json_normalize(x,record_path="marks",meta=["name","age"])
''')
st.markdown('''
- **record_path** only used when values are in list of dictionary
- **meta** is used to get remaining columns
''')
st.markdown('''
- Output will be:
''')
st.markdown('''
| maths | hindi | name | age |
|-------|-------|------|-----|
| 11 | 41 | p1 | 22 |
| 22 | 31 | p1 | 21 |
''')
elif file_type == "HTML":
st.title("HTML")
st.markdown('''
- HTML **(Hypertext Markup Language)**
- HTML (HyperText Markup Language) is the standard language used to create and structure content on the web, using tags to define elements such as text, images, links, and other multimedia.
''')
st.subheader("How to read and get the tabular data from the URLs?...")
st.code('''import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League")
data
''')
st.markdown('''
- It gives all the tables related to Indian_Premier_League
- But if we want to get one particular table amongst all tables we need to give unique word related to that particular table we needed
''')
st.code('''import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League",match="Mitchell Starc")
data
''')
st.markdown('''
- It gives the particular table which has the word matching as "Mitchell Starc"
''')