Spaces:
Sleeping
Sleeping
File size: 16,111 Bytes
d819190 273767b d819190 fbb0bf1 d819190 48ba83b d819190 273767b d819190 63c5d1c 6e25e3f b45b4c8 54a7467 b45b4c8 54a7467 b45b4c8 54a7467 3894596 54a7467 3894596 63f0d8a 3894596 c298ce7 4ab08c6 3894596 54a7467 63f0d8a 4ab08c6 63f0d8a 95a82db 750e1d6 ef5cd02 e4e7bd2 ef5cd02 95a82db e4e7bd2 dac8878 54a7467 dac8878 54a7467 63bacbf 4100473 63bacbf b45b4c8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 |
import streamlit as st
import pandas as pd
st.markdown("""
<style>
/* Set a soft background color */
body {
background-color: #eef2f7;
}
/* Style for main title */
h1 {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 700;
text-align: center;
margin-bottom: 25px;
}
/* Style for headers */
h2 {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 600;
margin-top: 30px;
}
/* Style for subheaders */
h3 {
color: red;
font-family: 'Roboto', sans-serif;
font-weight: 500;
margin-top: 20px;
}
.custom-subheader {
color: black;
font-family: 'Roboto', sans-serif;
font-weight: 600;
margin-bottom: 15px;
}
/* Paragraph styling */
p {
font-family: 'Georgia', serif;
line-height: 1.8;
color: black;
margin-bottom: 20px;
}
/* List styling with checkmark bullets */
.icon-bullet {
list-style-type: none;
padding-left: 20px;
}
.icon-bullet li {
font-family: 'Georgia', serif;
font-size: 1.1em;
margin-bottom: 10px;
color: black;
}
.icon-bullet li::before {
content: "β";
padding-right: 10px;
color: black;
}
/* Sidebar styling */
.sidebar .sidebar-content {
background-color: #ffffff;
border-radius: 10px;
padding: 15px;
}
.sidebar h2 {
color: #495057;
}
/* Custom button style */
.streamlit-button {
background-color: #00FFFF;
color: #000000;
font-weight: bold;
}
</style>
""", unsafe_allow_html=True)
st.subheader("Semi-Structured Data")
st.markdown("""
Semi-structured data is a type of data that doesn't conform to a strict schema but has organizational properties, such as tags or markers, to separate elements. Examples :
<ul class="icon-bullet">
<li>CSV </li>
<li>JSON </li>
<li>HTML </li>
<li>XML</li>
</ul>
""", unsafe_allow_html=True)
st.sidebar.title("Navigation π§")
file_type = st.sidebar.radio(
"Choose a file type :",
("CSV", "XML", "JSON", "HTML"))
if file_type == "CSV":
st.title("CSV")
st.markdown('''
- **CSV (Comma-Separated Values)**
- CSV (Comma-Separated Values) is a simple file format used to store tabular data, where each line represents a row, and columns are separated by commas.
- It is commonly used for data exchange between applications, such as spreadsheets and databases.
- CSV files are saved with the `.csv` extension
''')
st.header('**Issues in CSV:**')
st.subheader('''**1. ParserError:**''')
st.markdown('''- This error occurs when we have extra column.
- This error is mostly occurs when CSV is created in Text editor.
- To overcome parse error we use a parameter known as on_bad_lines where default it takes error
- on_bad_lines = "error" -- default
- on_bad_lines = "skip" -- skip unnecessary rows
- on_bad_lines = "warn" -- skip unnecessary rows but warns
''')
st.subheader('**Solution:**')
st.code('''import pandas as pd
pd.read_csv('sample.csv',on_bad_lines='warn' or 'skip' or 'error')
''')
st.write('------------------------------------------------')
st.subheader('''**2.Encoding:**''')
st.markdown('''- Encoding is a process of translating a character, numbers into ASCII and then binary number.
- To preserve the information of characters and that error is Unicode-decode error
- If a proper enconding while reading csv is not used then the letter/characters will be decode to other binary number which will cause to loss the information.
- Most of the csv will be in UTF-8, but not all csv.''')
st.subheader('**Solution:**')
st.code('''import pandas as pd
import encodings
l=encodings.aliases.aliases.keys() # list of all encodings
for y in l:
try:
pd.read_csv('sample.csv',encoding='utf-8')
print('{} is an correct encoding')
except UnicodeDecodeError:
print('{} is not an correct encoding'.format(y))
except LookUpError:
print('{} is not supported'.format(y))
''')
st.write('------------------------------------------------')
st.subheader('''**3. Out of memory:**''')
st.markdown('''- If we dont have enough memory to load the dataset then we will divide them into chunks.
- CSV is stored in RAM as huge file is not supported and dataset is breaked into chunks
- Chunks are the part of the data, which takes chunksize as a number of rows.
- If we have 100_00_000 & chunksize = 1000, this means the data will be divided in 1000 rows called as chunks.
- Its output will be in generator.
- Generator can return multiple values, it uses yield instead of return
- All chunks are stored as objects and the object is iterabel''')
st.subheader('**Solution:**')
st.code('''
import pandas as pd
pd.read_csv('spam.csv', encoding='latin', chunksize= 100)
''')
st.subheader('''**4. Takes long time to load a huge dataset:**''')
st.markdown("It takes long time to load a huge dataset")
st.subheader('**Solution:**')
st.markdown('''Polars : as it is replica of pandas
- It is faster than pandas
''')
elif file_type == "XML":
st.title("XML")
st.markdown('''
- XML is an Extensible Markup Language
- In XML, we can define our own tags
- XML (Extensible Markup Language) is a flexible, text-based format used for storing and transporting structured data.
- It uses tags to define elements and attributes, making it both human-readable and machine-readable.
as **Extensible** Markup Language
''')
# Example : XML Structure
st.subheader('**XML Structure**')
st.markdown('''
A simple XML file
''')
st.code('''
<data>
<person>
<name>Harika</name>
<age>21</age>
<height>145</height>
</person>
<person>
<name>sreeja/name>
<age>22</age>
<height>153</height>
</person>
</data>
''')
st.code('''
import pandas as pd
# Example: Reading a XML file
df = pd.read_xml('data.xml', xpath='/data/person')
print(df)
''')
st.markdown('''
The output DataFrame will look like this:
| name | age | height |
|----------------|------------|------ |
| Harika | 21 | 145 |
| sreeja | 22 | 153 |
''')
st.markdown('''
**`xpath` parameter**:
- Specifies the XML path to extract specific elements.
- For example:
- `xpath='/data/person'`: Extracts all `<person>` elements from `<data>`. ''')
# Example 2: Nested XML Structure
st.subheader('**Nested XML Structure**')
st.markdown('''
A more complex XML file with nested elements and attributes.
''')
st.code('''
<company>
<department id="1" name="HR">
<employee>
<name>John Doe</name>
<position>Manager</position>
</employee>
<employee>
<name>Jane Smith</name>
<position>Assistant</position>
</employee>
</department>
<department id="2" name="Engineering">
<employee>
<name>Emily Johnson</name>
<position>Engineer</position>
</employee>
</department>
</company>
''')
st.code('''
import pandas as pd
# Example: Reading a nested XML file
df = pd.read_xml(
'nested.xml',
xpath='.//employee',
elem_cols=['name', 'position'],
attr_cols=['id', 'name']
)
print(df)
''')
st.markdown('''
The output DataFrame will look like this:
| id | department name | name | position |
|----|-----------------|---------------|------------|
| 1 | HR | John Doe | Manager |
| 1 | HR | Jane Smith | Assistant |
| 2 | Engineering | Emily Johnson | Engineer |
''')
st.markdown('''
1. **`elem_cols` parameter**:
- Specifies the child tags (elements) you want to include in the DataFrame.
- Example:
- `elem_cols=['name', 'position']`: Extracts `<name>` and `<position>` from `<employee>` tags.
2. **`attr_cols` parameter**:
- Specifies the attributes of the parent elements to include in the DataFrame.
- Example:
- `attr_cols=['id', 'name']`: Extracts the `id` and `name` attributes from the `<department>` tag.
''')
st.markdown('''
By combining `xpath`, `elem_cols`, and `attr_cols`, you can efficiently parse complex XML files into structured DataFrames.
''')
elif file_type == "JSON":
st.title("JSON")
st.markdown('''
- JSON **(Javascript Orient Notation)**
- It is text-based data format used to store and exchange data, structured as key-value pairs and arrays
- JSON can be both
- **Structured** and **Semi-structured**
- Default json format is dictionary format
- Key should always be in string
- Reading json files is ** pd.read_json() ** it takes only string
''')
st.header("Structred JSON Format")
st.markdown('''
- A structured JSON format organizes data hierarchically using key-value pairs, arrays, and nested objects, ensuring readability
- In structured JSON format, orient refers to the way data is organized or structured, particularly when converting between tabular data like DataFrames and JSON.
- It determines the layout of the JSON representation.
- Common orient types are
- β Index
- β Columns
- β Values
- β Split
''')
st.subheader("How to read Structured JSON Format?...")
st.code('''import pandas as pd
a = '{"name":["harii","sree"],"age":[12,13]}'
pd.read_json(a)
''')
st.header("Converting DataFrame into JSON...")
st.subheader("Orient as Index...")
st.markdown('''
- In orient as index while converting DataFrame into json here keys are index and values are dictionary
- Inside dictionary keys are column names and values are values present in the data
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
ind = data.to_json(orient="index")
ind
''')
st.markdown('''
- Output will be :
''')
st.code('''
output = '{"0":{"name":"harii","age":21,"weight":34},"1":{"name":"sree","age":24,"weight":45},"2":{"name":"gowtham","age":25,"weight":67}}'
''')
st.subheader("Orient as Columns...")
st.markdown('''
- In orient as columns while converting DataFrame into json here keys become column names and values are dictionary
- Inside dictionary keys are indices and values are values present in the data
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
col = data.to_json(orient="columns")
col
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '{"name":{"0":"harii","1":"sree","2":"gowtham"},"age":{"0":21,"1":24,"2":25},"weight":{"0":34,"1":45,"2":67}}'
''')
st.subheader("Orient as Values...")
st.markdown('''
- In orient as values while converting DataFrame into json it gives you list of list (nested list)
''')
st.code('''import pandas as pd
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
val = data.to_json(orient="values")
val
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '[["harii",21,34],["sree",24,45],["gowtham",25,67]]'
''')
st.subheader("Orient as Split...")
st.markdown('''
- In orient as split while converting DataFrame into json it gives column names , indices and data seperately
- It is a dictionary of list
''')
st.code('''import pandas as pd'
a = '{"name":["harii","sree","gowtham"],"age":[21,24,25],"weight":[34,45,67]}'
data = pd.read_json(a)
spl = data.to_json(orient="split")
spl
''')
st.markdown('''
- Output will be:
''')
st.code('''
output = '{"columns":["name","age","weight"],"index":[0,1,2],"data":[["harii",21,34],["sree",24,45],["gowtham",25,67]]}'
''')
st.subheader("**Issues in Structured JSON Format**")
st.markdown('''
- As in structured json format it reads only string format when the data is in heterogenous like dictionry of dictionary and list of dictionary we can't use pd.json_normalize()
- To handle this issue we use semi-structured json format which can handle nested structures
''')
st.header("Semi-structured JSON Format")
st.markdown('''
- A semi-structured JSON format lacks a fixed schema, allowing irregular or nested structures
- It takes list of dictionaries where each dict will be acting as a single row
- Semi-structured json format has different types to convert dataframe into json
- β max_level ---> how much deeper it takes to take the values of column
- β record_path ---> only used when values are in list of dictionary
- β meta ---> it is used to get remaining columns
''')
st.subheader("How to read Semi-structured JSON Format?...")
st.code('''import pandas as pd
b = {"name":"a","marks":{"sem1":{"maths":22,"science":23},"sem2":{"maths":24,"science":25}}}
pd.json_normalize(b)
''')
st.header("Converting DataFrame into JSON...")
st.subheader("Using max_level... ")
st.code('''import pandas as pd
a = {"name":"harii","age":23,"marks":{"sem1":{"hindi":10,"science":39},"sem2":{"hindi":12,"science":32}}}
pd.json_normalize(a)
pd.json_normalize(a,max_level=1)
''')
st.markdown('''
- **max_level** gives how much deeper it takes to take the values of column
''')
st.subheader("Using record_path and meta...")
st.code('''import pandas as pd
x=[{"name":"p1","age":22,"marks":[{"maths":11,"hindi":41}]},{"name":"p1","age":21,"marks":[{"maths":22,"hindi":31}]}]
pd.json_normalize(x,record_path="marks",meta=["name","age"])
''')
st.markdown('''
- **record_path** only used when values are in list of dictionary
- **meta** is used to get remaining columns
''')
st.markdown('''
- Output will be:
''')
st.markdown('''
| maths | hindi | name | age |
|-------|-------|------|-----|
| 11 | 41 | p1 | 22 |
| 22 | 31 | p1 | 21 |
''')
elif file_type == "HTML":
st.title("HTML")
st.markdown('''
- HTML **(Hypertext Markup Language)**
- HTML (HyperText Markup Language) is the standard language used to create and structure content on the web, using tags to define elements such as text, images, links, and other multimedia.
''')
st.subheader("How to read and get the tabular data from the URLs?...")
st.code('''import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League")
data
''')
st.markdown('''
- It gives all the tables related to Indian_Premier_League
- But if we want to get one particular table amongst all tables we need to give unique word related to that particular table we needed
''')
st.code('''import pandas as pd
data = pd.read_html("https://en.wikipedia.org/wiki/Indian_Premier_League",match="Mitchell Starc")
data
''')
st.markdown('''
- It gives the particular table which has the word matching as "Mitchell Starc"
''')
|