LakshmiHarika commited on
Commit
e8c3b76
Β·
verified Β·
1 Parent(s): 7b48346

Create 5Text Preprocessing.py

Browse files
Files changed (1) hide show
  1. pages/5Text Preprocessing.py +310 -0
pages/5Text Preprocessing.py ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+
3
+ st.markdown(
4
+ """
5
+ <style>
6
+ body {
7
+ background-color: #f9f9f9; /* Light gray background */
8
+ font-family: 'Arial', sans-serif;
9
+ }
10
+ @keyframes fadeIn {
11
+ 0% { opacity: 0; transform: translateY(-20px); }
12
+ 100% { opacity: 1; transform: translateY(0); }
13
+ }
14
+ .title {
15
+ text-align: center;
16
+ color: #2c3e50; /* Deep gray-blue */
17
+ font-size: 3rem;
18
+ font-weight: bold;
19
+ animation: fadeIn 1s ease-in-out;
20
+ }
21
+ .caption {
22
+ text-align: center;
23
+ font-style: italic;
24
+ font-size: 1.2rem;
25
+ color: #7f8c8d; /* Soft gray */
26
+ animation: fadeIn 1.5s ease-in-out;
27
+ }
28
+ .section {
29
+ font-size: 1.1rem;
30
+ text-align: justify;
31
+ line-height: 1.8;
32
+ color: #34495e; /* Muted gray */
33
+ background: #ffffff; /* White card-style background */
34
+ padding: 20px;
35
+ border-radius: 10px;
36
+ box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
37
+ animation: fadeIn 2s ease-in-out;
38
+ margin: 10px 0;
39
+ }
40
+ .image-container {
41
+ text-align: center;
42
+ margin: 20px 0;
43
+ animation: fadeIn 2.5s ease-in-out;
44
+ }
45
+ .image-container img {
46
+ border-radius: 15px;
47
+ box-shadow: 0 4px 8px rgba(0, 0, 0, 0.2);
48
+ transition: transform 0.3s ease-in-out;
49
+ }
50
+ .image-container img:hover {
51
+ transform: scale(1.05); /* Subtle zoom effect */
52
+ }
53
+ </style>
54
+ """,
55
+ unsafe_allow_html=True,
56
+ )
57
+ st.header(":blue[✨ Pre-processing of Text πŸ—ΊοΈ]")
58
+
59
+ st.markdown("<div class='section'>", unsafe_allow_html=True)
60
+ st.markdown("<h2 class='title'>πŸ” Transforming Raw Text</h2>", unsafe_allow_html=True)
61
+ st.markdown("<p class='subtitle'>Convert unstructured text into a clean and structured format</p>", unsafe_allow_html=True)
62
+
63
+ st.info("πŸ“Œ **We preprocess text in three key ways:**\n\nβœ… Cleaning - Problem-specific\n\nβœ… Simple Pre-processing\n\nβœ… Advanced Pre-processing")
64
+
65
+ st.markdown("</div>", unsafe_allow_html=True)
66
+
67
+
68
+ st.markdown("### ✨ **Essential Preprocessing Techniques:**")
69
+
70
+ st.markdown("βœ… **Convert Text Case** – Convert all words to **uppercase** or **lowercase** to maintain consistency and reduce dimensions.")
71
+ st.markdown("βœ… **Handle URLs and Tags** – Based on problem statement, either remove or preserve them.")
72
+ st.markdown("βœ… **Mentions, Digits, Emails** – Generally removed unless required by the analysis.")
73
+ st.markdown("βœ… **Preserve Emojis** – Emojis carry sentiment and play a crucial role in NLP tasks.")
74
+ st.markdown("βœ… **Grammar Preservation** – If grammar is needed, avoid removing punctuation.")
75
+
76
+ st.success("πŸš€ Well-structured and clean text significantly boosts ML model performance!")
77
+
78
+
79
+ st.markdown("<div class='section'>", unsafe_allow_html=True)
80
+ st.markdown("<h2 class='title'>πŸ” NLP Data Preprocessing</h2>", unsafe_allow_html=True)
81
+ st.markdown("<p class='subtitle'>Transforming raw text into structured data for better ML performance</p>", unsafe_allow_html=True)
82
+
83
+
84
+ st.success("πŸ“Œ **Benefits of Preprocessing:**\n\nβœ… Reduces dimensionality\n\nβœ… Improves ML performance\n\nβœ… Converts raw text into problem-specific structured data")
85
+
86
+ st.markdown("### ✨ **Essential Preprocessing Steps:**")
87
+
88
+ st.markdown(
89
+ """
90
+ <div class='image-container'>
91
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/66bde9bf3c885d04498227a0/HtdtNm-UJdfN057BeKSgV.png",width=400>
92
+ </div>
93
+ """,
94
+ unsafe_allow_html=True,
95
+ )
96
+
97
+
98
+ st.markdown("βœ… **Converting Text Case** – Reduces dimensionality; case conversion depends on problem statement.")
99
+ st.markdown("βœ… **Removing URLs, Tags, and Mentions** – Retain only if required by the problem statement.")
100
+ st.markdown("βœ… **Handling Emojis** – Preserve or convert emoji data based on context.")
101
+ st.markdown("βœ… **Expanding Contractions & Acronyms** – Convert abbreviations into standard text.")
102
+ st.markdown("βœ… **Stop Words Removal** – Optional, useful for text simplification.")
103
+ st.markdown("βœ… **Stemming & Lemmatization** – Perform only if grammar is **not** crucial for analysis.")
104
+
105
+ st.markdown("</div>", unsafe_allow_html=True)
106
+
107
+ st.markdown("<h1 class='header-title'>πŸ” Stemming & Lemmatization πŸ’¬</h1>", unsafe_allow_html=True)
108
+
109
+ st.markdown(
110
+ """
111
+ <div class='info-box'>
112
+ <p>πŸ“ In English, words are often made up of three components:</p>
113
+ <ul>
114
+ <li>πŸ”Ή <span class='highlight'>Prefix</span> + <span class='highlight'>Word</span> + <span class='highlight'>Suffix</span></li>
115
+ </ul>
116
+ <p>βœ… Words without a suffix are called <span class='highlight'>Root Words</span>.</p>
117
+ <p>βœ… If a suffix is added to a root word, the resulting word is an <span class='highlight'>Inflected Word</span>:</p>
118
+ <ul>
119
+ <li>πŸ› οΈ <span class='highlight'>Root Word</span> + <span class='highlight'>Suffix</span> = Inflected Word</li>
120
+ </ul>
121
+ <p>πŸ’¬ The process of removing the suffix from inflected words to get the root word is known as:</p>
122
+ <ul>
123
+ <li>βœ‚οΈ <span class='highlight'>Stemming</span></li>
124
+ <li>🧠 <span class='highlight'>Lemmatization</span></li>
125
+ </ul>
126
+ </div>
127
+ """,
128
+ unsafe_allow_html=True
129
+ )
130
+
131
+ st.markdown("<h1 class='header-title'>🌿 Stemming πŸ”Ž</h1>", unsafe_allow_html=True)
132
+
133
+
134
+ st.markdown(
135
+ """
136
+ <div class='info-box'>
137
+ <p>πŸ“ <span class='highlight'>Stemming</span> is the process of reducing an **inflected word** to its root form, known as the <span class='highlight'>stem</span>.</p>
138
+ <ul>
139
+ <li>πŸ”Ή <span class='highlight'>Inflected word ➝ Root word (Stem)</span></li>
140
+ <li>⚑ The **stem may not always be a valid English word**.</li>
141
+ <li>πŸš€ <span class='highlight'>Performance is faster</span> compared to lemmatization.</li>
142
+ <li>⚑ It is used only for **Removal**.</li>
143
+ <li>πŸ”Ή Whenever we need **Retrieval system** we use stemming</li>
144
+ </ul>
145
+ </div>
146
+ """,
147
+ unsafe_allow_html=True
148
+ )
149
+
150
+ st.markdown("<h2 class='sub-header'>πŸ“Œ Types of Stemming</h2>", unsafe_allow_html=True)
151
+ st.markdown("""
152
+ - There are **three** major types of stemming techniques:
153
+ - πŸ”Ή **Porter Stemmer** πŸ›οΈ (Rule-based, works in 5 stages)
154
+ - πŸ”Ή **Snowball Stemmer** ❄️ (Rule-base, Language adaptable)
155
+ - πŸ”Ή **Lancaster Stemmer** πŸ” (Iterative, aggressive removal)
156
+ """)
157
+
158
+ st.markdown("<h2 class='sub-header'>πŸ›οΈ Porter Stemmer</h2>", unsafe_allow_html=True)
159
+ st.markdown(
160
+ """
161
+ <div class='info-box'>
162
+ <ul>
163
+ <li>πŸ”Ή A Rule-based Algorithm for stemming.</li>
164
+ <li>πŸ”Ή It takes a particular word which have some rule.</li>
165
+ <li>πŸ”Ή For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.</li>
166
+ <li>πŸ”Ή Works only for the English language.</li>
167
+ </ul>
168
+ </div>
169
+ """,
170
+ unsafe_allow_html=True
171
+ )
172
+
173
+ st.markdown("<h2 class='sub-header'>❄️ Snowball Stemmer</h2>", unsafe_allow_html=True)
174
+ st.markdown(
175
+ """
176
+ <div class='info-box'>
177
+ <ul>
178
+ <li>πŸ”Ή An advanced version of the Porter Stemmer.</li>
179
+ <li>πŸ”Ή Can be applied to multiple languages.</li>
180
+ </ul>
181
+ </div>
182
+ """,
183
+ unsafe_allow_html=True
184
+ )
185
+
186
+
187
+ st.markdown("<h2 class='sub-header'>πŸ” Lancaster Stemmer</h2>", unsafe_allow_html=True)
188
+ st.markdown(
189
+ """
190
+ <div class='info-box'>
191
+ <ul>
192
+ <li>πŸ”Ή An Iterative Algorithm for stemming.</li>
193
+ <li>πŸ”Ή Removes suffixes in multiple iterations.</li>
194
+ <li>⚠️ More aggressive removal, which might result in non-English words.</li>
195
+ </ul>
196
+ </div>
197
+ """,
198
+ unsafe_allow_html=True
199
+ )
200
+
201
+ st.markdown("<h1 class='header-title'>πŸ“– Lemmatization πŸ”Ž</h1>", unsafe_allow_html=True)
202
+
203
+ st.markdown(
204
+ """
205
+ <div class='info-box'>
206
+ <p>πŸ“ <span class='highlight'>Lemmatization</span> is the process of reducing an inflected word to its root form, known as the <span class='highlight'>lemma</span>.</p>
207
+ <ul>
208
+ <li>πŸ”Ή <span class='highlight'>Inflected word ➝ Root word (Lemma)</span></li>
209
+ <li>βœ… The lemma is always an actual English word.</li>
210
+ <li>🐒 <span class='highlight'>Performance is slower</span> than stemming.</li>
211
+ <li>πŸ” Both removal & dictionary-based checking are performed.</li>
212
+ <li>πŸ“ Used when we need to preserve grammar in text.</li>
213
+ </ul>
214
+ </div>
215
+ """,
216
+ unsafe_allow_html=True
217
+ )
218
+
219
+ st.markdown("<h2 class='sub-header'>πŸ“š WordNet Lemmatizer</h2>", unsafe_allow_html=True)
220
+
221
+ st.markdown(
222
+ """
223
+ <div class='info-box'>
224
+ <ul>
225
+ <li>πŸ”Ή Takes an inflected word as input.</li>
226
+ <li>πŸ—„οΈ Searches in a huge dictionary (WordNet) containing millions of English words.</li>
227
+ <li>πŸ”„ Iteratively removes suffixes & checks:</li>
228
+ <ul>
229
+ <li>βœ”οΈ If it's an actual English word, it continues removing more suffixes.</li>
230
+ <li>❌ If it's not an English word, the last valid root word is returned as the lemma.</li>
231
+ </ul>
232
+ </ul>
233
+ </div>
234
+ """,
235
+ unsafe_allow_html=True
236
+ )
237
+
238
+ st.code('''
239
+ from nltk.corpus import stopwords
240
+ from nltk.stem import PorterStemmer,LancasterStemmer,SnowballStemmer,WordNetLemmatizer
241
+ from nltk.tokenize import sent_tokenize,word_tokenize
242
+ def pre_process(data,col,case="lower",tags=True,url=True,mail=True,mentions=True,digits=True,dates=True,emojis=True,contraction=True,stopwordss=True,inflection="stem",stemmer="porter",punc=True):
243
+ stp = stopwords.words("english")
244
+ stp.remove("not")
245
+ ps = PorterStemmer()
246
+ ls = LancasterStemmer()
247
+ sb = SnowballStemmer(language="english")
248
+ wl = WordNetLemmatizer()
249
+
250
+ ## emoji
251
+ if emojis==True:
252
+ data[col] = data[col].apply(lambda x:emoji.demojize(x,delimiters=('','')))
253
+ else:
254
+ pass
255
+ ## case
256
+ if case == "lower":
257
+ data[col]=data[col].str.lower()
258
+ elif case == "upper":
259
+ data[col]=data[col].str.upper()
260
+ else:
261
+ pass
262
+ ## tags
263
+ if tags==True:
264
+ data[col] = data[col].apply(lambda x:re.sub("<.*?>"," ",x))
265
+ else:
266
+ pass
267
+ ## urls
268
+ if url ==True:
269
+ data[col] = data[col].apply(lambda x:re.sub("https://\S+"," ",x))
270
+ else:
271
+ pass
272
+ ## mails
273
+ if mail ==True:
274
+ data[col] = data[col].apply(lambda x:re.sub("\S+@\S+"," ",x))
275
+ else:
276
+ pass
277
+ ## mentions
278
+ if mentions ==True:
279
+ data[col] = data[col].apply(lambda x:re.sub("\B[@#]\S+"," ",x))
280
+ else:
281
+ pass
282
+ ## digits
283
+ if mentions ==True:
284
+ data[col] = data[col].apply(lambda x:re.sub("\d"," ",x))
285
+ else:
286
+ pass
287
+ ## dates
288
+ if dates==True:
289
+ data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4}$"," ",x))
290
+ data[col] = data[col].apply(lambda x:re.sub(r"^[0-9]{4}\/[0-9]{1,2}\/[0-9]{1,2}$"," ",x))
291
+ else:
292
+ pass
293
+ ## contractions
294
+ if contraction==True:
295
+ data[col]= data[col].apply(lambda x:contractions.fix(x))
296
+ else:
297
+ pass
298
+ ## punctuations
299
+ if punc == True:
300
+ data[col]=data[col].apply(lambda x:re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]'," ",x))
301
+ else:
302
+ pass
303
+
304
+ return data
305
+ ''')
306
+
307
+ st.markdown('''
308
+ - It'll give the pre-processed text data
309
+ - We'll get the clean processed data on which we can perform feature engineering
310
+ ''')