Mpavan45 commited on
Commit
a42352d
Β·
verified Β·
1 Parent(s): c495b41

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +47 -5
app.py CHANGED
@@ -223,18 +223,60 @@ elif st.session_state.selected_page == "NLP Lifecycle":
223
  st.write("""
224
  #### 🧹 4. Text Preprocessing
225
  Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
 
 
226
  - **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
227
  - **Stop Words Removal**: Removing common words that don’t contribute much information.
228
  - **Lemmatization**: Converting words into their base or dictionary form.
229
  - **Stemming**: Cutting off prefixes or suffixes from words.
230
  - **Lowercasing**: Converting all characters in the text to lowercase.
231
-
232
- **Example**: For the sentence "The quick brown fox is running fast", after preprocessing:
233
- - Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast"]
234
- - Stop Words Removal: ["quick", "brown", "fox", "running", "fast"]
235
- - Lemmatization: ["quick", "brown", "fox", "run", "fast"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
  """)
237
 
 
238
  elif lifecycle_option == "Feature Engineering":
239
  st.write("""
240
  #### πŸ“ 5. Text Representation
 
223
  st.write("""
224
  #### 🧹 4. Text Preprocessing
225
  Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
226
+
227
+ **Key Steps in Text Preprocessing:**
228
  - **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
229
  - **Stop Words Removal**: Removing common words that don’t contribute much information.
230
  - **Lemmatization**: Converting words into their base or dictionary form.
231
  - **Stemming**: Cutting off prefixes or suffixes from words.
232
  - **Lowercasing**: Converting all characters in the text to lowercase.
233
+ - **HTML Tag Removal**: Eliminating any HTML tags like `<p>`, `<a>`, `<b>`, etc.
234
+ - **URL Removal**: Stripping out URLs such as `http://example.com` or `www.example.com`.
235
+ - **Emoji Removal**: Removing emojis (e.g., πŸ™‚, πŸš€) as they are typically non-informative for analysis.
236
+ - **Hashtag Removal**: Removing hashtags (e.g., `#data`, `#AI`) that might not be relevant for textual analysis.
237
+ - **Special Characters Removal**: Stripping out symbols or characters that don't contribute to the meaning of the text.
238
+
239
+ **Example**: For the sentence "The quick brown fox is running fast 🦊 #awesome http://example.com", after preprocessing:
240
+ - Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast", "🦊", "#awesome", "http://example.com"]
241
+ - HTML Tag Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "🦊", "#awesome", "http://example.com"]
242
+ - URL Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "🦊", "#awesome"]
243
+ - Emoji Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "#awesome"]
244
+ - Hashtag Removal: ["The", "quick", "brown", "fox", "is", "running", "fast"]
245
+
246
+ Now, let's apply the necessary text preprocessing steps to clean up the data:
247
+
248
+ ```python
249
+ import re
250
+ from bs4 import BeautifulSoup
251
+
252
+ # Sample data
253
+ data = "Check out this amazing post! 😊 #awesome #data http://example.com Visit us at www.example.com! πŸš€ Let's talk about AI! #AI #machinelearning"
254
+
255
+ # Remove HTML tags using BeautifulSoup
256
+ cleaned_data = BeautifulSoup(data, "html.parser").get_text()
257
+
258
+ # Remove URLs using regular expression
259
+ cleaned_data = re.sub(r'http\S+|www\S+', '', cleaned_data)
260
+
261
+ # Remove emojis using a regular expression (basic pattern for common emojis)
262
+ cleaned_data = re.sub(r'[^\w\s,]', '', cleaned_data)
263
+
264
+ # Remove hashtags (words starting with #)
265
+ cleaned_data = re.sub(r'#\w+', '', cleaned_data)
266
+
267
+ st.write(f"Cleaned Data: {cleaned_data}")
268
+ ```
269
+
270
+ **Output**: After cleaning, the data will look like:
271
+ ```
272
+ Check out this amazing post awesome data
273
+ Visit us at Let's talk about AI machinelearning
274
+ ```
275
+
276
+ By following these preprocessing steps, the raw text is now ready for further analysis or machine learning tasks.
277
  """)
278
 
279
+
280
  elif lifecycle_option == "Feature Engineering":
281
  st.write("""
282
  #### πŸ“ 5. Text Representation