Spaces:

Harika22
/

Machine_learning

Sleeping

App Files Files Community

Harika22 commited on Dec 13, 2024

Commit

63c5d1c

verified ·

1 Parent(s): a8fe8ef

Update pages/6_Semi_structured_data.py

Browse files

Files changed (1) hide show

pages/6_Semi_structured_data.py +80 -0

pages/6_Semi_structured_data.py CHANGED Viewed

@@ -77,3 +77,83 @@ st.markdown("""
     </style>
     """, unsafe_allow_html=True)

     </style>
     """, unsafe_allow_html=True)
+st.subheader("Semi-Structured Data")
+st.markdown("""
+    Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
+    <ul class="icon-bullet">
+        <li>CSV </li>
+        <li>JSON </li>
+        <li>HTML </li>
+        <li>XML</li>
+    </ul>
+""", unsafe_allow_html=True)
+st.sidebar.title("Navigation 🧭")
+file_type = st.sidebar.radio(
+    "Choose a file type :",
+    ("CSV", "XML", "JSON", "HTML"))
+if file_type == "CSV":
+    st.title("CSV")
+    st.markdown('''
+    - **CSV (Comma-Separated Values)**
+    - CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
+    - It is commonly used for data exchange between applications, such as spreadsheets and databases.
+    - CSV files are saved with the `.csv` extension
+    ''')
+    st.header('**Issues in CSV:**')
+    st.subheader('''**1. ParserError:**''')
+    st.markdown('''- This error occurs when we have extra column.
+    - This error is mostly occurs when CSV is created in Text editor.
+    - To overcome parse error we use a parameter known as on_bad_lines where default it takes error
+    - on_bad_lines = "error" -- default
+    - on_bad_lines = "skip" -- skip unnecessary rows
+    - on_bad_lines = "warn" -- skip unnecessary rows but warns
+    ''')
+    st.subheader('**Solution:**')
+    st.code('''import pandas as pd
+    pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
+                ''')
+    st.write('------------------------------------------------')
+    st.subheader('''**2.Encoding:**''')
+    st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
+    - To preserve the information of characters and that error is Unicode-decode error
+    - If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
+    - Most of the csv will be in UTF-8, but not all csv.''')
+    st.subheader('**Solution:**')
+    st.code('''import pandas as pd
+    import encodings
+    l=encodings.aliases.aliases.keys() # list of all encodings
+    for y in l:
+        try:
+            pd.read_csv('sample.csv',encoding='utf-8')
+            print('{} is an correct encoding')
+        except UnicodeDecodeError:
+            print('{} is not an correct encoding'.format(y))
+        except LookUpError:
+            print('{} is not supported'.format(y))
+                ''')
+    st.write('------------------------------------------------')
+    st.subheader('''**3. Out of memory:**''')
+    st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
+    - CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
+    - Chunks are the part of the data, which takes chunksize as a number of rows.
+    - If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
+    - Its output will be in generator.
+    - Generator can return multiple values, it uses yield instead of return
+    - All chunks are stored as objects and the object is iterabel''')
+    st.subheader('**Solution:**')
+    st.code('''
+            import pandas as pd
+            pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
+        ''')
+    st.subheader('''**4. Takes long time to load a huge dataset:**''')
+    st.markdown("It takes long time to load a huge dataset")
+    st.subheader('**Solution:**')
+    st.markdown('''Polars : as it is replica of pandas
+    - It is faster than pandas
+    ''')