Spaces:

Harika22
/

Machine_learning

Sleeping

App Files Files Community

Machine_learning / pages /6_Semi_structured_data.py

Harika22

Update pages/6_Semi_structured_data.py

3ff1286 verified about 1 year ago

raw

history blame

15.4 kB

	import streamlit as st
	import pandas as pd

	st.markdown("""
	<style>
	/* Set a soft background color */
	body {
	background-color: #eef2f7;
	}
	/* Style for main title */
	h1 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 700;
	text-align: center;
	margin-bottom: 25px;
	}
	/* Style for headers */
	h2 {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-top: 30px;
	}

	/* Style for subheaders */
	h3 {
	color: red;
	font-family: 'Roboto', sans-serif;
	font-weight: 500;
	margin-top: 20px;
	}
	.custom-subheader {
	color: black;
	font-family: 'Roboto', sans-serif;
	font-weight: 600;
	margin-bottom: 15px;
	}
	/* Paragraph styling */
	p {
	font-family: 'Georgia', serif;
	line-height: 1.8;
	color: black;
	margin-bottom: 20px;
	}
	/* List styling with checkmark bullets */
	.icon-bullet {
	list-style-type: none;
	padding-left: 20px;
	}
	.icon-bullet li {
	font-family: 'Georgia', serif;
	font-size: 1.1em;
	margin-bottom: 10px;
	color: black;
	}
	.icon-bullet li::before {
	content: "◆";
	padding-right: 10px;
	color: black;
	}
	/* Sidebar styling */
	.sidebar .sidebar-content {
	background-color: #ffffff;
	border-radius: 10px;
	padding: 15px;
	}
	.sidebar h2 {
	color: #495057;
	}
	/* Custom button style */
	.streamlit-button {
	background-color: #00FFFF;
	color: #000000;
	font-weight: bold;
	}
	</style>
	""", unsafe_allow_html=True)

	st.subheader("Semi-Structured Data")
	st.markdown("""
	Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
	<ul class="icon-bullet">
	<li>CSV </li>
	<li>JSON </li>
	<li>HTML </li>
	<li>XML</li>
	</ul>
	""", unsafe_allow_html=True)

	st.sidebar.title("Navigation 🧭")
	file_type = st.sidebar.radio(
	"Choose a file type :",
	("CSV", "XML", "JSON", "HTML"))

	if file_type == "CSV":
	st.title("CSV")
	st.markdown('''
	- CSV (Comma-Separated Values)
	- CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
	- It is commonly used for data exchange between applications, such as spreadsheets and databases.
	- CSV files are saved with the `.csv` extension
	''')

	st.header('Issues in CSV:')
	st.subheader('''1. ParserError:''')
	st.markdown('''- This error occurs when we have extra column.
	- This error is mostly occurs when CSV is created in Text editor.
	- To overcome parse error we use a parameter known as on_bad_lines where default it takes error
	- on_bad_lines = "error" -- default
	- on_bad_lines = "skip" -- skip unnecessary rows
	- on_bad_lines = "warn" -- skip unnecessary rows but warns
	''')
	st.subheader('Solution:')
	st.code('''import pandas as pd
	pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
	''')
	st.write('------------------------------------------------')
	st.subheader('''2.Encoding:''')
	st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
	- To preserve the information of characters and that error is Unicode-decode error
	- If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
	- Most of the csv will be in UTF-8, but not all csv.''')
	st.subheader('Solution:')
	st.code('''import pandas as pd
	import encodings
	l=encodings.aliases.aliases.keys() # list of all encodings
	for y in l:
	try:
	pd.read_csv('sample.csv',encoding='utf-8')
	print('{} is an correct encoding')
	except UnicodeDecodeError:
	print('{} is not an correct encoding'.format(y))
	except LookUpError:
	print('{} is not supported'.format(y))

	''')
	st.write('------------------------------------------------')
	st.subheader('''3. Out of memory:''')
	st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
	- CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
	- Chunks are the part of the data, which takes chunksize as a number of rows.
	- If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
	- Its output will be in generator.
	- Generator can return multiple values, it uses yield instead of return
	- All chunks are stored as objects and the object is iterabel''')
	st.subheader('Solution:')
	st.code('''
	import pandas as pd
	pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
	''')
	st.subheader('''4. Takes long time to load a huge dataset:''')
	st.markdown("It takes long time to load a huge dataset")
	st.subheader('Solution:')
	st.markdown('''Polars : as it is replica of pandas
	- It is faster than pandas
	''')


	elif file_type == "XML":
	st.title("XML")
	st.markdown('''
	- XML is an Extensible Markup Language
	- In XML, we can define our own tags
	- XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data.
	- It uses tags to define elements and attributes, making it both human-readable and machine-readable.
	as Extensible Markup Language
	''')

	# Example : XML Structure
	st.subheader('XML Structure')
	st.markdown('''
	A simple XML file
	''')
	st.code('''
	<data>
	<person>
	<name>Harika</name>
	<age>21</age>
	<height>145</height>
	</person>
	<person>
	<name>sreeja/name>
	<age>22</age>
	<height>153</height>
	</person>
	</data>
	''')

	st.code('''
	import pandas as pd

	# Example: Reading a XML file
	df = pd.read_xml('data.xml', xpath='/data/person')
	print(df)
	''')

	st.markdown('''
	The output DataFrame will look like this:
	\| name \| age \| height \|
	\|----------------\|------------\|------ \|
	\| Harika \| 21 \| 145 \|
	\| sreeja \| 22 \| 153 \|
	''')


	st.markdown('''
	`xpath` parameter:
	- Specifies the XML path to extract specific elements.
	- For example:
	- `xpath='/data/person'`: Extracts all `<person>` elements from `<data>`. ''')


	# Example 2: Nested XML Structure
	st.subheader('Nested XML Structure')
	st.markdown('''
	A more complex XML file with nested elements and attributes.
	''')
	st.code('''
	<company>
	<department id="1" name="HR">
	<employee>
	<name>John Doe</name>
	<position>Manager</position>
	</employee>
	<employee>
	<name>Jane Smith</name>
	<position>Assistant</position>
	</employee>
	</department>
	<department id="2" name="Engineering">
	<employee>
	<name>Emily Johnson</name>
	<position>Engineer</position>
	</employee>
	</department>
	</company>
	''')

	st.code('''
	import pandas as pd

	# Example: Reading a nested XML file
	df = pd.read_xml(
	'nested.xml',
	xpath='.//employee',
	elem_cols=['name', 'position'],
	attr_cols=['id', 'name']
	)
	print(df)
	''')

	st.markdown('''
	The output DataFrame will look like this:
	\| id \| department name \| name \| position \|
	\|----\|-----------------\|---------------\|------------\|
	\| 1 \| HR \| John Doe \| Manager \|
	\| 1 \| HR \| Jane Smith \| Assistant \|
	\| 2 \| Engineering \| Emily Johnson \| Engineer \|
	''')

	st.markdown('''
	1. `elem_cols` parameter:
	- Specifies the child tags (elements) you want to include in the DataFrame.
	- Example:
	- `elem_cols=['name', 'position']`: Extracts `<name>` and `<position>` from `<employee>` tags.

	2. `attr_cols` parameter:
	- Specifies the attributes of the parent elements to include in the DataFrame.
	- Example:
	- `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `<department>` tag.
	''')

	st.markdown('''
	By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames.
	''')


	elif file_type == "JSON":
	st.title("JSON")
	st.markdown('''
	- JSON (Javascript Orient Notation)
	- It is text-based data format used to store and exchange data, structured as key-value pairs and arrays
	- JSON can be both
	- Structured and Semi-structured
	- Default json format is dictionary format
	- Key should always be in string
	- Reading json files is pd.read_json() it takes only string
	''')
	st.header("Structred JSON Format")
	st.markdown('''
	- A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability
	- In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON.
	- It determines the layout of the JSON representation.
	- Common orient types are
	- ◆ Index
	- ◆ Columns
	- ◆ Values
	- ◆ Split
	''')
	st.subheader("How to read Structured JSON Format?...")
	st.code('''import pandas as pd
	a = '{"name":["harii","sree"],"age":[12,13]}'
	pd.read_json(a)
	''')
	st.header("Converting DataFrame into JSON...")
	st.subheader("Orient as Index...")
	st.markdown('''
	- In orient as index while converting DataFrame into json here keys are index and values are dictionary
	- Inside dictionary keys are column names and values are values present in the data
	''')
	st.code('''import pandas as pd
	a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
	data = pd.read_json(a)
	ind = data.to_json(orient="index")
	ind
	''')
	st.markdown('''
	- Output will be :
	''')
	st.code('''
	output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}'
	''')

	st.subheader("Orient as Columns...")
	st.markdown('''
	- In orient as columns while converting DataFrame into json here keys become column names and values are dictionary
	- Inside dictionary keys are indices and values are values present in the data
	''')
	st.code('''import pandas as pd
	a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
	data = pd.read_json(a)
	col = data.to_json(orient="columns")
	col
	''')
	st.markdown('''
	- Output will be:
	''')
	st.code('''
	output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}'
	''')

	st.subheader("Orient as Values...")
	st.markdown('''
	- In orient as values while converting DataFrame into json it gives you list of list (nested list)
	''')
	st.code('''import pandas as pd
	a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
	data = pd.read_json(a)
	val = data.to_json(orient="values")
	val
	''')
	st.markdown('''
	- Output will be:
	''')
	st.code('''
	output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]'
	''')

	st.subheader("Orient as Split...")
	st.markdown('''
	- In orient as split while converting DataFrame into json it gives column names , indices and data seperately
	- It is a dictionary of list
	''')
	st.code('''import pandas as pd'
	a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
	data = pd.read_json(a)
	spl = data.to_json(orient="split")
	spl
	''')
	st.markdown('''
	- Output will be:
	''')
	st.code('''
	output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}'
	''')
	st.subheader("Issues in Structured JSON Format")
	st.markdown('''
	- As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize()
	- To handle this issue we use semi-structured json format which can handle nested structures
	''')
	st.header("Semi-structured JSON Format")
	st.markdown('''
	- A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures
	- It takes list of dictionaries where each dict will be acting as a single row
	- Semi-structured json format has different types to convert dataframe into json
	- ◆ max_level ---> how much deeper it takes to take the values of column
	- ◆ record_path ---> only used when values are in list of dictionary
	- ◆ meta ---> it is used to get remaining columns
	''')
	st.subheader("How to read Semi-structured JSON Format?...")
	st.code('''import pandas as pd
	b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}}
	pd.json_normalize(b)
	''')

	st.header("Converting DataFrame into JSON...")
	st.subheader("Using max_level... ")
	st.code('''import pandas as pd
	a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}}
	pd.json_normalize(a)
	pd.json_normalize(a,max_level=1)
	''')
	st.markdown('''
	- max_level gives how much deeper it takes to take the values of column
	''')
	st.markdown('''
	- Output will be:
	''')

	st.markdown('''

	\| name \| age \| marks.sem1 \| marks.sem2 \|
	\|-------\|-----\|------------------------------\|------------------------------\|
	\| harii \| 23 \| {'hindi': 10} \| {'hindi': 12 } \|

	''')

	st.subheader("Using record_path and meta...")
	st.code('''import pandas as pd
	x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}]
	pd.json_normalize(x,record_path="marks",meta=["name","age"])
	''')
	st.markdown('''
	- record_path only used when values are in list of dictionary
	- meta is used to get remaining columns
	''')

	st.markdown('''
	- Output will be:
	''')
	st.markdown('''

	\| maths \| hindi \| name \| age \|
	\|-------\|-------\|------\|-----\|
	\| 11 \| 41 \| p1 \| 22 \|
	\| 22 \| 31 \| p1 \| 21 \|

	''')