import streamlit as st import pandas as pd st.markdown(""" """, unsafe_allow_html=True) st.subheader("Semi-Structured Data") st.markdown(""" Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples : """, unsafe_allow_html=True) st.sidebar.title("Navigation 🧭") file_type = st.sidebar.radio( "Choose a file type :", ("CSV", "XML", "JSON", "HTML")) if file_type == "CSV": st.title("CSV") st.markdown(''' - **CSV (Comma-Separated Values)** - CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas. - It is commonly used for data exchange between applications, such as spreadsheets and databases. - CSV files are saved with the `.csv` extension ''') st.header('**Issues in CSV:**') st.subheader('''**1. ParserError:**''') st.markdown('''- This error occurs when we have extra column. - This error is mostly occurs when CSV is created in Text editor. - To overcome parse error we use a parameter known as on_bad_lines where default it takes error - on_bad_lines = "error" -- default - on_bad_lines = "skip" -- skip unnecessary rows - on_bad_lines = "warn" -- skip unnecessary rows but warns ''') st.subheader('**Solution:**') st.code('''import pandas as pd pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error') ''') st.write('------------------------------------------------') st.subheader('''**2.Encoding:**''') st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number. - To preserve the information of characters and that error is Unicode-decode error - If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information. - Most of the csv will be in UTF-8, but not all csv.''') st.subheader('**Solution:**') st.code('''import pandas as pd import encodings l=encodings.aliases.aliases.keys() # list of all encodings for y in l: try: pd.read_csv('sample.csv',encoding='utf-8') print('{} is an correct encoding') except UnicodeDecodeError: print('{} is not an correct encoding'.format(y)) except LookUpError: print('{} is not supported'.format(y)) ''') st.write('------------------------------------------------') st.subheader('''**3. Out of memory:**''') st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks. - CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks - Chunks are the part of the data, which takes chunksize as a number of rows. - If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks. - Its output will be in generator. - Generator can return multiple values, it uses yield instead of return - All chunks are stored as objects and the object is iterabel''') st.subheader('**Solution:**') st.code(''' import pandas as pd pd.read_csv('spam.csv', encoding='latin', chunksize= 100) ''') st.subheader('''**4. Takes long time to load a huge dataset:**''') st.markdown("It takes long time to load a huge dataset") st.subheader('**Solution:**') st.markdown('''Polars : as it is replica of pandas - It is faster than pandas ''') elif file_type == "XML": st.title("XML") st.markdown(''' - XML is an Extensible Markup Language - In XML, we can define our own tags - XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data. - It uses tags to define elements and attributes, making it both human-readable and machine-readable. as **Extensible** Markup Language ''') # Example : XML Structure st.subheader('**XML Structure**') st.markdown(''' A simple XML file ''') st.code(''' Harika 21 145 sreeja/name> 22 153 ''') st.code(''' import pandas as pd # Example: Reading a XML file df = pd.read_xml('data.xml', xpath='/data/person') print(df) ''') st.markdown(''' The output DataFrame will look like this: | name | age | height | |----------------|------------|------ | | Harika | 21 | 145 | | sreeja | 22 | 153 | ''') st.markdown(''' **`xpath` parameter**: - Specifies the XML path to extract specific elements. - For example: - `xpath='/data/person'`: Extracts all `` elements from ``. ''') # Example 2: Nested XML Structure st.subheader('**Nested XML Structure**') st.markdown(''' A more complex XML file with nested elements and attributes. ''') st.code(''' John Doe Manager Jane Smith Assistant Emily Johnson Engineer ''') st.code(''' import pandas as pd # Example: Reading a nested XML file df = pd.read_xml( 'nested.xml', xpath='.//employee', elem_cols=['name', 'position'], attr_cols=['id', 'name'] ) print(df) ''') st.markdown(''' The output DataFrame will look like this: | id | department name | name | position | |----|-----------------|---------------|------------| | 1 | HR | John Doe | Manager | | 1 | HR | Jane Smith | Assistant | | 2 | Engineering | Emily Johnson | Engineer | ''') st.markdown(''' 1. **`elem_cols` parameter**: - Specifies the child tags (elements) you want to include in the DataFrame. - Example: - `elem_cols=['name', 'position']`: Extracts `` and `` from `` tags. 2. **`attr_cols` parameter**: - Specifies the attributes of the parent elements to include in the DataFrame. - Example: - `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `` tag. ''') st.markdown(''' By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames. ''') elif file_type == "JSON": st.title("JSON") st.markdown(''' - JSON **(Javascript Orient Notation)** - It is text-based data format used to store and exchange data, structured as key-value pairs and arrays - JSON can be both - **Structured** and **Semi-structured** - Default json format is dictionary format - Key should always be in string - Reading json files is ** pd.read_json() ** it takes only string ''') st.header("Structred JSON Format") st.markdown(''' - A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability - In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON. - It determines the layout of the JSON representation. - Common orient types are - ◆ Index - ◆ Columns - ◆ Values - ◆ Split ''') st.subheader("How to read Structured JSON Format?...") st.code('''import pandas as pd a = '{"name":["harii","sree"],"age":[12,13]}' pd.read_json(a) ''') st.header("Converting DataFrame into JSON...") st.subheader("Orient as Index...") st.markdown(''' - In orient as index while converting DataFrame into json here keys are index and values are dictionary - Inside dictionary keys are column names and values are values present in the data ''') st.code('''import pandas as pd a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' data = pd.read_json(a) ind = data.to_json(orient="index") ind ''') st.markdown(''' - Output will be : ''') st.code(''' output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}' ''') st.subheader("Orient as Columns...") st.markdown(''' - In orient as columns while converting DataFrame into json here keys become column names and values are dictionary - Inside dictionary keys are indices and values are values present in the data ''') st.code('''import pandas as pd a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' data = pd.read_json(a) col = data.to_json(orient="columns") col ''') st.markdown(''' - Output will be: ''') st.code(''' output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}' ''') st.subheader("Orient as Values...") st.markdown(''' - In orient as values while converting DataFrame into json it gives you list of list (nested list) ''') st.code('''import pandas as pd a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' data = pd.read_json(a) val = data.to_json(orient="values") val ''') st.markdown(''' - Output will be: ''') st.code(''' output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]' ''') st.subheader("Orient as Split...") st.markdown(''' - In orient as split while converting DataFrame into json it gives column names , indices and data seperately - It is a dictionary of list ''') st.code('''import pandas as pd' a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}' data = pd.read_json(a) spl = data.to_json(orient="split") spl ''') st.markdown(''' - Output will be: ''') st.code(''' output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}' ''') st.subheader("**Issues in Structured JSON Format**") st.markdown(''' - As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize() - To handle this issue we use semi-structured json format which can handle nested structures ''') st.header("Semi-structured JSON Format") st.markdown(''' - A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures - It takes list of dictionaries where each dict will be acting as a single row - Semi-structured json format has different types to convert dataframe into json - ◆ max_level ---> how much deeper it takes to take the values of column - ◆ record_path ---> only used when values are in list of dictionary - ◆ meta ---> it is used to get remaining columns ''') st.subheader("How to read Semi-structured JSON Format?...") st.code('''import pandas as pd b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}} pd.json_normalize(b) ''') st.header("Converting DataFrame into JSON...") st.subheader("Using max_level... ") st.code('''import pandas as pd a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}} pd.json_normalize(a) pd.json_normalize(a,max_level=1) ''') st.markdown(''' - **max_level** gives how much deeper it takes to take the values of column ''') st.markdown(''' - Output will be: ''') st.markdown(''' | name | age | marks.sem1 | marks.sem2 | |-------|-----|------------------------------|------------------------------| | harii | 23 | {'hindi': 10} | {'hindi': 12 } | ''') st.subheader("Using record_path and meta...") st.code('''import pandas as pd x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}] pd.json_normalize(x,record_path="marks",meta=["name","age"]) ''') st.markdown(''' - **record_path** only used when values are in list of dictionary - **meta** is used to get remaining columns ''') st.markdown(''' - Output will be: ''') st.markdown(''' | maths | hindi | name | age | |-------|-------|------|-----| | 11 | 41 | p1 | 22 | | 22 | 31 | p1 | 21 | ''')