Harika22 commited on
Commit
63c5d1c
·
verified ·
1 Parent(s): a8fe8ef

Update pages/6_Semi_structured_data.py

Browse files
Files changed (1) hide show
  1. pages/6_Semi_structured_data.py +80 -0
pages/6_Semi_structured_data.py CHANGED
@@ -77,3 +77,83 @@ st.markdown("""
77
  </style>
78
  """, unsafe_allow_html=True)
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  </style>
78
  """, unsafe_allow_html=True)
79
 
80
+ st.subheader("Semi-Structured Data")
81
+ st.markdown("""
82
+ Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
83
+ <ul class="icon-bullet">
84
+ <li>CSV </li>
85
+ <li>JSON </li>
86
+ <li>HTML </li>
87
+ <li>XML</li>
88
+ </ul>
89
+ """, unsafe_allow_html=True)
90
+
91
+ st.sidebar.title("Navigation 🧭")
92
+ file_type = st.sidebar.radio(
93
+ "Choose a file type :",
94
+ ("CSV", "XML", "JSON", "HTML"))
95
+
96
+ if file_type == "CSV":
97
+ st.title("CSV")
98
+ st.markdown('''
99
+ - **CSV (Comma-Separated Values)**
100
+ - CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
101
+ - It is commonly used for data exchange between applications, such as spreadsheets and databases.
102
+ - CSV files are saved with the `.csv` extension
103
+ ''')
104
+
105
+ st.header('**Issues in CSV:**')
106
+ st.subheader('''**1. ParserError:**''')
107
+ st.markdown('''- This error occurs when we have extra column.
108
+ - This error is mostly occurs when CSV is created in Text editor.
109
+ - To overcome parse error we use a parameter known as on_bad_lines where default it takes error
110
+ - on_bad_lines = "error" -- default
111
+ - on_bad_lines = "skip" -- skip unnecessary rows
112
+ - on_bad_lines = "warn" -- skip unnecessary rows but warns
113
+ ''')
114
+ st.subheader('**Solution:**')
115
+ st.code('''import pandas as pd
116
+ pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
117
+ ''')
118
+ st.write('------------------------------------------------')
119
+ st.subheader('''**2.Encoding:**''')
120
+ st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
121
+ - To preserve the information of characters and that error is Unicode-decode error
122
+ - If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
123
+ - Most of the csv will be in UTF-8, but not all csv.''')
124
+ st.subheader('**Solution:**')
125
+ st.code('''import pandas as pd
126
+ import encodings
127
+ l=encodings.aliases.aliases.keys() # list of all encodings
128
+ for y in l:
129
+ try:
130
+ pd.read_csv('sample.csv',encoding='utf-8')
131
+ print('{} is an correct encoding')
132
+ except UnicodeDecodeError:
133
+ print('{} is not an correct encoding'.format(y))
134
+ except LookUpError:
135
+ print('{} is not supported'.format(y))
136
+
137
+ ''')
138
+ st.write('------------------------------------------------')
139
+ st.subheader('''**3. Out of memory:**''')
140
+ st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
141
+ - CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
142
+ - Chunks are the part of the data, which takes chunksize as a number of rows.
143
+ - If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
144
+ - Its output will be in generator.
145
+ - Generator can return multiple values, it uses yield instead of return
146
+ - All chunks are stored as objects and the object is iterabel''')
147
+ st.subheader('**Solution:**')
148
+ st.code('''
149
+ import pandas as pd
150
+ pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
151
+ ''')
152
+ st.subheader('''**4. Takes long time to load a huge dataset:**''')
153
+ st.markdown("It takes long time to load a huge dataset")
154
+ st.subheader('**Solution:**')
155
+ st.markdown('''Polars : as it is replica of pandas
156
+ - It is faster than pandas
157
+ ''')
158
+
159
+