UmaKumpatla commited on
Commit
a662b64
Β·
verified Β·
1 Parent(s): fc6f3ad

Update pages/3.Simple EDA.py

Browse files
Files changed (1) hide show
  1. pages/3.Simple EDA.py +145 -0
pages/3.Simple EDA.py CHANGED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import pandas as pd
3
+ import numpy as np
4
+ #import re
5
+ #import emoji
6
+
7
+ st.markdown("""
8
+ <style>
9
+ /* Set a soft background color */
10
+ body {
11
+ background-color: #eef2f7;
12
+ }
13
+ /* Style for main title */
14
+ h1 {
15
+ color: black;
16
+ font-family: 'Roboto', sans-serif;
17
+ font-weight: 700;
18
+ text-align: center;
19
+ margin-bottom: 25px;
20
+ }
21
+ /* Style for headers */
22
+ h2 {
23
+ color: black;
24
+ font-family: 'Roboto', sans-serif;
25
+ font-weight: 600;
26
+ margin-top: 30px;
27
+ }
28
+ /* Style for subheaders */
29
+ h3 {
30
+ color: red;
31
+ font-family: 'Roboto', sans-serif;
32
+ font-weight: 500;
33
+ margin-top: 20px;
34
+ }
35
+ .custom-subheader {
36
+ color: black;
37
+ font-family: 'Roboto', sans-serif;
38
+ font-weight: 600;
39
+ margin-bottom: 15px;
40
+ }
41
+ /* Paragraph styling */
42
+ p {
43
+ font-family: 'Georgia', serif;
44
+ line-height: 1.8;
45
+ color: black;
46
+ margin-bottom: 20px;
47
+ }
48
+ /* List styling with checkmark bullets */
49
+ .icon-bullet {
50
+ list-style-type: none;
51
+ padding-left: 20px;
52
+ }
53
+ .icon-bullet li {
54
+ font-family: 'Georgia', serif;
55
+ font-size: 1.1em;
56
+ margin-bottom: 10px;
57
+ color: black;
58
+ }
59
+ .icon-bullet li::before {
60
+ content: "β—†";
61
+ padding-right: 10px;
62
+ color: black;
63
+ }
64
+ /* Sidebar styling */
65
+ .sidebar .sidebar-content {
66
+ background-color: #ffffff;
67
+ border-radius: 10px;
68
+ padding: 15px;
69
+ }
70
+ .sidebar h2 {
71
+ color: #495057;
72
+ }
73
+ /* Custom button style */
74
+ .streamlit-button {
75
+ background-color: #00FFFF;
76
+ color: #000000;
77
+ font-weight: bold;
78
+ }
79
+ </style>
80
+ """, unsafe_allow_html=True)
81
+
82
+ st.header(":red[πŸ“Š Simple EDA πŸ’¬]")
83
+
84
+ # Introduction to Simple EDA
85
+ st.markdown("<div class='section'>", unsafe_allow_html=True)
86
+ st.markdown("<h2 class='title'>πŸ” Understanding Simple EDA</h2>", unsafe_allow_html=True)
87
+ st.markdown("<p class='subtitle'>Evaluating raw text data quality before processing</p>", unsafe_allow_html=True)
88
+
89
+ st.info("πŸ“Œ **Simple EDA is a crucial step in the NLP lifecycle:**\n\n❁ Ensures raw data quality\n\n❁ Not dependent on problem statement\n\n❁ Helps in better data exploration")
90
+
91
+ st.markdown("</div>", unsafe_allow_html=True)
92
+
93
+ st.subheader(":violet[πŸ“ƒ Major Simple EDA Steps]")
94
+
95
+ st.markdown("❁ **Check Text Case** – Identify if text is in **lowercase, uppercase, or mixed case**.")
96
+ st.markdown("❁ **Detect HTML & URL Tags** – Analyze if text contains unwanted elements.")
97
+ st.markdown("❁ **Identify URLs** – Ensure URLs are either preserved or removed based on problem statement.")
98
+ st.markdown("❁ **Detect Mentions & Hashtags** – Find occurrences of `@mentions` or `#hashtags`.")
99
+ st.markdown("❁ **Identify Numeric Data** – Detect if text includes **digits or numerical data**.")
100
+ st.markdown("❁ **Analyze Punctuation Usage** – Check whether punctuation marks affect text clarity.")
101
+ st.markdown("❁ **Detect Emojis** – Ensure **emoji-based sentiments** are not lost.")
102
+ st.markdown("❁ **Analyze Date/Time Formats** – Identify the presence of date/time-related text.")
103
+
104
+ st.success("πŸš€ Performing **Simple EDA** ensures structured and high-quality text data, leading to better NLP model performance!")
105
+
106
+ st.code('''
107
+ import pandas as pd
108
+ import numpy as np
109
+ import re
110
+ import emoji
111
+ def simple_eda(data, column):
112
+ lower_upper = data[column].apply(lambda x: True if (x.lower()) or (x.upper()) else False).sum()
113
+ tags = data[column].apply(lambda x: True if re.search("<.*?>", x) else False).sum()
114
+ urls = data[column].apply(lambda x: True if re.search("https://\\S+", x) else False).sum()
115
+ mails = data[column].apply(lambda x: True if re.search("\\S+@\\S+", x) else False).sum()
116
+ mentions = data[column].apply(lambda x: True if re.search("\\B[@#]\\S+", x) else False).sum()
117
+ emojis = data[column].apply(lambda x: True if emoji.emoji_count(x) else False).sum()
118
+ digit = data[column].apply(lambda x: True if re.search("\\d", x) else False).sum()
119
+ punc = data[column].apply(lambda x: True if re.search('[!"#$%&\'()*+,-./:;<=>?@[\\\\]^_`{|}~]', x) else False).sum()
120
+ dates = data[column].apply(lambda x: True if re.search(r"^[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}$", x) else False).sum()
121
+ if lower_upper > 0:
122
+ print("Text has a combination of cases.")
123
+ if tags > 0:
124
+ print("Text contains HTML tags.")
125
+ if urls > 0:
126
+ print("Text contains URLs.")
127
+ if mails > 0:
128
+ print("Text contains email addresses.")
129
+ if mentions > 0:
130
+ print("Text contains mentions or hashtags.")
131
+ if emojis > 0:
132
+ print("Text contains emojis.")
133
+ if digit > 0:
134
+ print("Text contains digits.")
135
+ if punc > 0:
136
+ print("Text contains punctuation marks.")
137
+ if dates > 0:
138
+ print("Text contains dates.")
139
+ ''')
140
+
141
+ st.markdown('''
142
+ - By following this code, we will check the exploration of the data.
143
+ - It essentially gives the quality of the collected text data.
144
+ - After the simple EDA, we will perform pre-processing on text based on the problem statement after knowing the quality of the data.
145
+ ''')