Harika22 commited on
Commit
d8b57db
Β·
verified Β·
1 Parent(s): 2a35973

Update pages/5_Pre-procesing_of_text.py

Browse files
Files changed (1) hide show
  1. pages/5_Pre-procesing_of_text.py +107 -0
pages/5_Pre-procesing_of_text.py CHANGED
@@ -127,3 +127,110 @@ st.markdown(
127
  """,
128
  unsafe_allow_html=True
129
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
127
  """,
128
  unsafe_allow_html=True
129
  )
130
+
131
+ st.markdown("<h1 class='header-title'>🌿 Stemming πŸ”Ž</h1>", unsafe_allow_html=True)
132
+
133
+
134
+ st.markdown(
135
+ """
136
+ <div class='info-box'>
137
+ <p>πŸ“ <span class='highlight'>Stemming</span> is the process of reducing an **inflected word** to its root form, known as the <span class='highlight'>stem</span>.</p>
138
+ <ul>
139
+ <li>πŸ”Ή <span class='highlight'>Inflected word ➝ Root word (Stem)</span></li>
140
+ <li>⚑ The **stem may not always be a valid English word**.</li>
141
+ <li>πŸš€ <span class='highlight'>Performance is faster</span> compared to lemmatization.</li>
142
+ <li>⚑ It is used only for **Removal**.</li>
143
+ <li>πŸ”Ή Whenever we need **Retrieval system** we use stemming</li>
144
+ </ul>
145
+ </div>
146
+ """,
147
+ unsafe_allow_html=True
148
+ )
149
+
150
+ st.markdown("<h2 class='sub-header'>πŸ“Œ Types of Stemming</h2>", unsafe_allow_html=True)
151
+ st.markdown("""
152
+ - There are **three** major types of stemming techniques:
153
+ - πŸ”Ή **Porter Stemmer** πŸ›οΈ (Rule-based, works in 5 stages)
154
+ - πŸ”Ή **Snowball Stemmer** ❄️ (Rule-base, Language adaptable)
155
+ - πŸ”Ή **Lancaster Stemmer** πŸ” (Iterative, aggressive removal)
156
+ """)
157
+
158
+ st.markdown("<h2 class='sub-header'>πŸ›οΈ Porter Stemmer</h2>", unsafe_allow_html=True)
159
+ st.markdown(
160
+ """
161
+ <div class='info-box'>
162
+ <ul>
163
+ <li>πŸ”Ή A **Rule-based Algorithm** for stemming.</li>
164
+ <li>πŸ”Ή It takes a particular word which have some rule.</li>
165
+ <li>πŸ”Ή For a particular rule it'll going on removing suffix till it reaches 5th stage until the inflection is removed.</li>
166
+ <li>πŸ”Ή Works **only for the English language**.</li>
167
+ </ul>
168
+ </div>
169
+ """,
170
+ unsafe_allow_html=True
171
+ )
172
+
173
+ st.markdown("<h2 class='sub-header'>❄️ Snowball Stemmer</h2>", unsafe_allow_html=True)
174
+ st.markdown(
175
+ """
176
+ <div class='info-box'>
177
+ <ul>
178
+ <li>πŸ”Ή An **advanced version of the Porter Stemmer**.</li>
179
+ <li>πŸ”Ή Can be applied to **multiple languages**.</li>
180
+ </ul>
181
+ </div>
182
+ """,
183
+ unsafe_allow_html=True
184
+ )
185
+
186
+
187
+ st.markdown("<h2 class='sub-header'>πŸ” Lancaster Stemmer</h2>", unsafe_allow_html=True)
188
+ st.markdown(
189
+ """
190
+ <div class='info-box'>
191
+ <ul>
192
+ <li>πŸ”Ή An **Iterative Algorithm** for stemming.</li>
193
+ <li>πŸ”Ή Removes suffixes in **multiple iterations**.</li>
194
+ <li>⚠️ **More aggressive removal**, which might result in **non-English words**.</li>
195
+ </ul>
196
+ </div>
197
+ """,
198
+ unsafe_allow_html=True
199
+ )
200
+
201
+ st.markdown("<h1 class='header-title'>πŸ“– Lemmatization πŸ”Ž</h1>", unsafe_allow_html=True)
202
+
203
+ st.markdown(
204
+ """
205
+ <div class='info-box'>
206
+ <p>πŸ“ <span class='highlight'>Lemmatization</span> is the process of reducing an **inflected word** to its root form, known as the <span class='highlight'>lemma</span>.</p>
207
+ <ul>
208
+ <li>πŸ”Ή <span class='highlight'>Inflected word ➝ Root word (Lemma)</span></li>
209
+ <li>βœ… The **lemma is always an actual English word**.</li>
210
+ <li>🐒 <span class='highlight'>Performance is slower</span> than stemming.</li>
211
+ <li>πŸ” **Both removal & dictionary-based checking** are performed.</li>
212
+ <li>πŸ“ **Used when we need to preserve grammar** in text.</li>
213
+ </ul>
214
+ </div>
215
+ """,
216
+ unsafe_allow_html=True
217
+ )
218
+
219
+ st.markdown("<h2 class='sub-header'>πŸ“š WordNet Lemmatizer</h2>", unsafe_allow_html=True)
220
+
221
+ st.markdown(
222
+ """
223
+ <div class='info-box'>
224
+ <ul>
225
+ <li>πŸ”Ή Takes an **inflected word** as input.</li>
226
+ <li>πŸ—„οΈ Searches in a **huge dictionary (WordNet)** containing millions of English words.</li>
227
+ <li>πŸ”„ **Iteratively removes suffixes** & checks:</li>
228
+ <ul>
229
+ <li>βœ”οΈ If it's an **actual English word**, it continues removing more suffixes.</li>
230
+ <li>❌ If it's **not an English word**, the last valid root word is returned as the lemma.</li>
231
+ </ul>
232
+ </ul>
233
+ </div>
234
+ """,
235
+ unsafe_allow_html=True
236
+ )