Spaces:
Sleeping
Sleeping
Update pages/6_Semi_structured_data.py
Browse files
pages/6_Semi_structured_data.py
CHANGED
|
@@ -77,3 +77,83 @@ st.markdown("""
|
|
| 77 |
</style>
|
| 78 |
""", unsafe_allow_html=True)
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
</style>
|
| 78 |
""", unsafe_allow_html=True)
|
| 79 |
|
| 80 |
+
st.subheader("Semi-Structured Data")
|
| 81 |
+
st.markdown("""
|
| 82 |
+
Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
|
| 83 |
+
<ul class="icon-bullet">
|
| 84 |
+
<li>CSV </li>
|
| 85 |
+
<li>JSON </li>
|
| 86 |
+
<li>HTML </li>
|
| 87 |
+
<li>XML</li>
|
| 88 |
+
</ul>
|
| 89 |
+
""", unsafe_allow_html=True)
|
| 90 |
+
|
| 91 |
+
st.sidebar.title("Navigation 🧭")
|
| 92 |
+
file_type = st.sidebar.radio(
|
| 93 |
+
"Choose a file type :",
|
| 94 |
+
("CSV", "XML", "JSON", "HTML"))
|
| 95 |
+
|
| 96 |
+
if file_type == "CSV":
|
| 97 |
+
st.title("CSV")
|
| 98 |
+
st.markdown('''
|
| 99 |
+
- **CSV (Comma-Separated Values)**
|
| 100 |
+
- CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
|
| 101 |
+
- It is commonly used for data exchange between applications, such as spreadsheets and databases.
|
| 102 |
+
- CSV files are saved with the `.csv` extension
|
| 103 |
+
''')
|
| 104 |
+
|
| 105 |
+
st.header('**Issues in CSV:**')
|
| 106 |
+
st.subheader('''**1. ParserError:**''')
|
| 107 |
+
st.markdown('''- This error occurs when we have extra column.
|
| 108 |
+
- This error is mostly occurs when CSV is created in Text editor.
|
| 109 |
+
- To overcome parse error we use a parameter known as on_bad_lines where default it takes error
|
| 110 |
+
- on_bad_lines = "error" -- default
|
| 111 |
+
- on_bad_lines = "skip" -- skip unnecessary rows
|
| 112 |
+
- on_bad_lines = "warn" -- skip unnecessary rows but warns
|
| 113 |
+
''')
|
| 114 |
+
st.subheader('**Solution:**')
|
| 115 |
+
st.code('''import pandas as pd
|
| 116 |
+
pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
|
| 117 |
+
''')
|
| 118 |
+
st.write('------------------------------------------------')
|
| 119 |
+
st.subheader('''**2.Encoding:**''')
|
| 120 |
+
st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
|
| 121 |
+
- To preserve the information of characters and that error is Unicode-decode error
|
| 122 |
+
- If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
|
| 123 |
+
- Most of the csv will be in UTF-8, but not all csv.''')
|
| 124 |
+
st.subheader('**Solution:**')
|
| 125 |
+
st.code('''import pandas as pd
|
| 126 |
+
import encodings
|
| 127 |
+
l=encodings.aliases.aliases.keys() # list of all encodings
|
| 128 |
+
for y in l:
|
| 129 |
+
try:
|
| 130 |
+
pd.read_csv('sample.csv',encoding='utf-8')
|
| 131 |
+
print('{} is an correct encoding')
|
| 132 |
+
except UnicodeDecodeError:
|
| 133 |
+
print('{} is not an correct encoding'.format(y))
|
| 134 |
+
except LookUpError:
|
| 135 |
+
print('{} is not supported'.format(y))
|
| 136 |
+
|
| 137 |
+
''')
|
| 138 |
+
st.write('------------------------------------------------')
|
| 139 |
+
st.subheader('''**3. Out of memory:**''')
|
| 140 |
+
st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
|
| 141 |
+
- CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
|
| 142 |
+
- Chunks are the part of the data, which takes chunksize as a number of rows.
|
| 143 |
+
- If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
|
| 144 |
+
- Its output will be in generator.
|
| 145 |
+
- Generator can return multiple values, it uses yield instead of return
|
| 146 |
+
- All chunks are stored as objects and the object is iterabel''')
|
| 147 |
+
st.subheader('**Solution:**')
|
| 148 |
+
st.code('''
|
| 149 |
+
import pandas as pd
|
| 150 |
+
pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
|
| 151 |
+
''')
|
| 152 |
+
st.subheader('''**4. Takes long time to load a huge dataset:**''')
|
| 153 |
+
st.markdown("It takes long time to load a huge dataset")
|
| 154 |
+
st.subheader('**Solution:**')
|
| 155 |
+
st.markdown('''Polars : as it is replica of pandas
|
| 156 |
+
- It is faster than pandas
|
| 157 |
+
''')
|
| 158 |
+
|
| 159 |
+
|