File size: 16,111 Bytes
d819190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
273767b
d819190
 
 
 
 
 
 
 
 
 
 
 
 
fbb0bf1
d819190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48ba83b
d819190
 
 
 
273767b
d819190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63c5d1c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e25e3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b45b4c8
 
 
 
 
 
 
 
54a7467
b45b4c8
 
54a7467
b45b4c8
54a7467
 
 
 
 
 
3894596
 
 
 
54a7467
 
 
 
 
 
3894596
 
 
 
 
 
 
 
63f0d8a
3894596
 
c298ce7
 
 
4ab08c6
 
 
3894596
54a7467
63f0d8a
 
 
 
 
 
 
 
 
 
4ab08c6
 
 
 
 
63f0d8a
 
 
95a82db
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
750e1d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef5cd02
e4e7bd2
ef5cd02
 
 
 
 
 
 
 
95a82db
e4e7bd2
 
 
 
 
 
 
 
 
 
dac8878
 
 
 
54a7467
dac8878
 
 
 
 
 
54a7467
63bacbf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4100473
63bacbf
 
b45b4c8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
import streamlit as st
import pandas as pd

st.markdown("""
    <style>
    /* Set a soft background color */
    body {
        background-color: #eef2f7;
    }
    /* Style for main title */
    h1 {
        color: black;
        font-family: 'Roboto', sans-serif;
        font-weight: 700;
        text-align: center;
        margin-bottom: 25px;
    }
    /* Style for headers */
    h2 {
        color: black;
        font-family: 'Roboto', sans-serif;
        font-weight: 600;
        margin-top: 30px;
    }
    
    /* Style for subheaders */
     h3 {
        color: red;
        font-family: 'Roboto', sans-serif;
        font-weight: 500;
        margin-top: 20px;
    }
    .custom-subheader {
        color: black;
        font-family: 'Roboto', sans-serif;
        font-weight: 600;
        margin-bottom: 15px;
    }
    /* Paragraph styling */
    p {
        font-family: 'Georgia', serif;
        line-height: 1.8;
        color: black;
        margin-bottom: 20px;
    }
    /* List styling with checkmark bullets */
    .icon-bullet {
        list-style-type: none;
        padding-left: 20px;
    }
    .icon-bullet li {
        font-family: 'Georgia', serif;
        font-size: 1.1em;
        margin-bottom: 10px;
        color: black;
    }
    .icon-bullet li::before {
        content: "β—†";
        padding-right: 10px;
        color: black;
    }
    /* Sidebar styling */
    .sidebar .sidebar-content {
        background-color: #ffffff;
        border-radius: 10px;
        padding: 15px;
    }
    .sidebar h2 {
        color: #495057;
    }
    /* Custom button style */
    .streamlit-button {
        background-color: #00FFFF;
        color: #000000;
        font-weight: bold;
    }
    </style>
    """, unsafe_allow_html=True)

st.subheader("Semi-Structured Data")
st.markdown("""
    Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
    <ul class="icon-bullet">
        <li>CSV </li>
        <li>JSON </li>
        <li>HTML </li>
        <li>XML</li>
    </ul>
""", unsafe_allow_html=True)

st.sidebar.title("Navigation 🧭")
file_type = st.sidebar.radio(
    "Choose a file type :",
    ("CSV", "XML", "JSON", "HTML"))

if file_type == "CSV":
    st.title("CSV")
    st.markdown('''
    - **CSV (Comma-Separated Values)**
    - CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas. 
    - It is commonly used for data exchange between applications, such as spreadsheets and databases.
    - CSV files are saved with the `.csv` extension
    ''')
    
    st.header('**Issues in CSV:**')
    st.subheader('''**1. ParserError:**''')
    st.markdown('''- This error occurs when we have extra column.
    - This error is mostly occurs when CSV is created in Text editor.
    - To overcome parse error we use a parameter known as on_bad_lines where default it takes error
    - on_bad_lines = "error" -- default
    - on_bad_lines = "skip" -- skip unnecessary rows
    - on_bad_lines = "warn" -- skip unnecessary rows but warns
    ''')
    st.subheader('**Solution:**')
    st.code('''import pandas as pd
    pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
                ''')
    st.write('------------------------------------------------')
    st.subheader('''**2.Encoding:**''')
    st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
    - To preserve the information of characters and that error is Unicode-decode error 
    - If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
    - Most of the csv will be in UTF-8, but not all csv.''')
    st.subheader('**Solution:**')
    st.code('''import pandas as pd
    import encodings
    l=encodings.aliases.aliases.keys() # list of all encodings
    for y in l:
        try:
            pd.read_csv('sample.csv',encoding='utf-8')
            print('{} is an correct encoding')
        except UnicodeDecodeError:
            print('{} is not an correct encoding'.format(y))
        except LookUpError:
            print('{} is not supported'.format(y))
            
                ''')
    st.write('------------------------------------------------')
    st.subheader('''**3. Out of memory:**''')
    st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
    - CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
    - Chunks are the part of the data, which takes chunksize as a number of rows.
    - If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
    - Its output will be in generator.
    - Generator can return multiple values, it uses yield instead of return
    - All chunks are stored as objects and the object is iterabel''')
    st.subheader('**Solution:**')
    st.code('''                
            import pandas as pd
            pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
        ''')
    st.subheader('''**4. Takes long time to load a huge dataset:**''')
    st.markdown("It takes long time to load a huge dataset")
    st.subheader('**Solution:**')
    st.markdown('''Polars : as it is replica of pandas
    - It is faster than pandas 
    ''')


elif file_type == "XML":
    st.title("XML")
    st.markdown(''' 
        - XML is an Extensible Markup Language
        - In XML, we can define our own tags 
        - XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data.
        - It uses tags to define elements and attributes, making it both human-readable and machine-readable.
          as **Extensible** Markup Language
            ''')   
    
    # Example : XML Structure
    st.subheader('**XML Structure**')
    st.markdown('''
    A simple XML file
    ''')
    st.code('''
    <data>
        <person>
            <name>Harika</name>
            <age>21</age>
            <height>145</height>
        </person>
        <person>
            <name>sreeja/name>
            <age>22</age>
            <height>153</height>
        </person>
    </data>
    ''')
    
    st.code('''
    import pandas as pd
    
    # Example: Reading a XML file
    df = pd.read_xml('data.xml', xpath='/data/person')
    print(df)
    ''')
    
    st.markdown('''
    The output DataFrame will look like this:
    | name           | age        | height |
    |----------------|------------|------  |
    | Harika         |  21        |   145  |
    | sreeja         |  22        |  153   |
    ''')


    st.markdown('''
     **`xpath` parameter**:  
       - Specifies the XML path to extract specific elements.  
       - For example:  
         - `xpath='/data/person'`: Extracts all `<person>` elements from `<data>`. ''')
    
    
    # Example 2: Nested XML Structure
    st.subheader('**Nested XML Structure**')
    st.markdown('''
    A more complex XML file with nested elements and attributes.
    ''')
    st.code('''
    <company>
        <department id="1" name="HR">
            <employee>
                <name>John Doe</name>
                <position>Manager</position>
            </employee>
            <employee>
                <name>Jane Smith</name>
                <position>Assistant</position>
            </employee>
        </department>
        <department id="2" name="Engineering">
            <employee>
                <name>Emily Johnson</name>
                <position>Engineer</position>
            </employee>
        </department>
    </company>
    ''')
    
    st.code('''
    import pandas as pd
    
    # Example: Reading a nested XML file
    df = pd.read_xml(
        'nested.xml', 
        xpath='.//employee', 
        elem_cols=['name', 'position'], 
        attr_cols=['id', 'name']
    )
    print(df)
    ''')
    
    st.markdown('''
    The output DataFrame will look like this:
    | id | department name | name          | position   |
    |----|-----------------|---------------|------------|
    | 1  | HR              | John Doe      | Manager    |
    | 1  | HR              | Jane Smith    | Assistant  |
    | 2  | Engineering     | Emily Johnson | Engineer   |
    ''')

    st.markdown('''
    1. **`elem_cols` parameter**:  
       - Specifies the child tags (elements) you want to include in the DataFrame.  
       - Example:  
         - `elem_cols=['name', 'position']`: Extracts `<name>` and `<position>` from `<employee>` tags.  
    
    2. **`attr_cols` parameter**:  
       - Specifies the attributes of the parent elements to include in the DataFrame.  
       - Example:  
         - `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `<department>` tag.  
    ''')
    
    st.markdown('''
    By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames.
    ''')


elif file_type == "JSON":
    st.title("JSON")
    st.markdown('''
    - JSON **(Javascript Orient Notation)**
    - It is text-based data format used to store and exchange data, structured as key-value pairs and arrays
    - JSON can be both
    - **Structured** and **Semi-structured**
    - Default json format is dictionary format 
    - Key should always be in string
    - Reading json files is ** pd.read_json() ** it takes only string
    ''')
    st.header("Structred JSON Format")
    st.markdown('''
    - A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability
    - In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON.
    - It determines the layout of the JSON representation.
    - Common orient types are 
    - β—† Index
    - β—† Columns
    - β—† Values
    - β—† Split
    ''')
    st.subheader("How to read Structured JSON Format?...")
    st.code('''import pandas as pd
    a = '{"name":["harii","sree"],"age":[12,13]}'
    pd.read_json(a)
    ''')
    st.header("Converting DataFrame into JSON...")
    st.subheader("Orient as Index...")
    st.markdown('''
    - In orient as index while converting DataFrame into json here keys are index and values are dictionary
    - Inside dictionary keys are column names and values are values present in the data
    ''')
    st.code('''import pandas as pd
    a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
    data = pd.read_json(a)
    ind = data.to_json(orient="index")
    ind
    ''')
    st.markdown('''
    - Output will be :
    ''')
    st.code('''
     output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}'
    ''')

    st.subheader("Orient as Columns...")
    st.markdown('''
    - In orient as columns while converting DataFrame into json here keys become column names and values are dictionary
    - Inside dictionary keys are indices and values are values present in the data
    ''')
    st.code('''import pandas as pd
     a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
     data = pd.read_json(a)
     col = data.to_json(orient="columns")
     col
    ''')
    st.markdown('''
    - Output will be:
    ''')
    st.code('''
     output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}'
    ''')

    st.subheader("Orient as Values...")
    st.markdown('''
    - In orient as values while converting DataFrame into json it gives you list of list (nested list)
    ''')
    st.code('''import pandas as pd
    a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
    data = pd.read_json(a)
    val = data.to_json(orient="values")
    val
    ''')
    st.markdown('''
    - Output will be:
    ''')
    st.code('''
     output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]' 
    ''')

    st.subheader("Orient as Split...")
    st.markdown('''
    - In orient as split while converting DataFrame into json it gives column names , indices and data seperately 
    - It is a dictionary of list
    ''')
    st.code('''import pandas as pd'
    a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
    data = pd.read_json(a)
    spl = data.to_json(orient="split")
    spl
    ''')
    st.markdown('''
    - Output will be:
    ''')
    st.code('''
    output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}'
    ''')
    st.subheader("**Issues in Structured JSON Format**")
    st.markdown('''
    - As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize()
    - To handle this issue we use semi-structured json format which can handle nested structures
    ''')
    st.header("Semi-structured JSON Format")
    st.markdown('''
    - A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures 
    - It takes list of dictionaries where each dict will be acting as a single row
    - Semi-structured json format has different types to convert dataframe into json
    - β—† max_level ---> how much deeper it takes to take the values of column
    - β—† record_path ---> only used when values are in list of dictionary
    - β—† meta ---> it is used to get remaining columns
    ''')
    st.subheader("How to read Semi-structured JSON Format?...")
    st.code('''import pandas as pd
    b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}}
    pd.json_normalize(b)
    ''')
    
    st.header("Converting DataFrame into JSON...")
    st.subheader("Using max_level... ")
    st.code('''import pandas as pd
    a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}}
    pd.json_normalize(a)
    pd.json_normalize(a,max_level=1)
    ''')
    st.markdown('''
    - **max_level** gives how much deeper it takes to take the values of column
    ''')

    st.subheader("Using record_path and meta...")
    st.code('''import pandas as pd
    x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}]
    pd.json_normalize(x,record_path="marks",meta=["name","age"])
    ''')
    st.markdown('''
    - **record_path** only used when values are in list of dictionary
    - **meta** is used to get remaining columns
    ''')

    st.markdown('''
    - Output will be:
    ''')
    st.markdown('''
    
    | maths | hindi | name | age |
    |-------|-------|------|-----|
    | 11	| 41	| p1   | 22  |
    | 22	| 31	| p1   | 21  |

    ''')
    
elif file_type == "HTML":
    st.title("HTML")
    st.markdown('''
    - HTML **(Hypertext Markup Language)**
    - HTML (HyperText Markup Language) is the standard language used to create and structure content on the web, using tags to define elements such as text, images, links, and other multimedia.
    ''')
    st.subheader("How to read and get the tabular data from the URLs?...")
    st.code('''import pandas as pd
    data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League")
    data
    ''')
    st.markdown('''
    - It gives all the tables related to Indian_Premier_League
    - But if we want to get one particular table amongst all tables we need to give unique word related to that particular table we needed
    ''')
    st.code('''import pandas as pd
    data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League",match="Mitchell Starc")
    data
    ''')
    st.markdown('''
    - It gives the particular table which has the word matching as "Mitchell Starc"
    ''')