Spaces:

Mpavan45
/

NLP_Blog

Build error

App Files Files Community

Mpavan45 commited on Dec 21, 2024

Commit

a42352d

verified ·

1 Parent(s): c495b41

Update app.py

Browse files

Files changed (1) hide show

app.py +47 -5

app.py CHANGED Viewed

@@ -223,18 +223,60 @@ elif st.session_state.selected_page == "NLP Lifecycle":
         st.write("""
         #### 🧹 4. Text Preprocessing
         Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
         - **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
         - **Stop Words Removal**: Removing common words that don’t contribute much information.
         - **Lemmatization**: Converting words into their base or dictionary form.
         - **Stemming**: Cutting off prefixes or suffixes from words.
         - **Lowercasing**: Converting all characters in the text to lowercase.
-        **Example**: For the sentence "The quick brown fox is running fast", after preprocessing:
-        - Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast"]
-        - Stop Words Removal: ["quick", "brown", "fox", "running", "fast"]
-        - Lemmatization: ["quick", "brown", "fox", "run", "fast"]
         """)
     elif lifecycle_option == "Feature Engineering":
         st.write("""
         #### 📝 5. Text Representation

         st.write("""
         #### 🧹 4. Text Preprocessing
         Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
+        **Key Steps in Text Preprocessing:**
         - **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
         - **Stop Words Removal**: Removing common words that don’t contribute much information.
         - **Lemmatization**: Converting words into their base or dictionary form.
         - **Stemming**: Cutting off prefixes or suffixes from words.
         - **Lowercasing**: Converting all characters in the text to lowercase.
+        - **HTML Tag Removal**: Eliminating any HTML tags like `<p>`, `<a>`, `<b>`, etc.
+        - **URL Removal**: Stripping out URLs such as `http://example.com` or `www.example.com`.
+        - **Emoji Removal**: Removing emojis (e.g., 🙂, 🚀) as they are typically non-informative for analysis.
+        - **Hashtag Removal**: Removing hashtags (e.g., `#data`, `#AI`) that might not be relevant for textual analysis.
+        - **Special Characters Removal**: Stripping out symbols or characters that don't contribute to the meaning of the text.
+        **Example**: For the sentence "The quick brown fox is running fast 🦊 #awesome http://example.com", after preprocessing:
+        - Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast", "🦊", "#awesome", "http://example.com"]
+        - HTML Tag Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "🦊", "#awesome", "http://example.com"]
+        - URL Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "🦊", "#awesome"]
+        - Emoji Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "#awesome"]
+        - Hashtag Removal: ["The", "quick", "brown", "fox", "is", "running", "fast"]
+        Now, let's apply the necessary text preprocessing steps to clean up the data:
+        ```python
+        import re
+        from bs4 import BeautifulSoup
+        # Sample data
+        data = "Check out this amazing post! 😊 #awesome #data http://example.com Visit us at www.example.com! 🚀 Let's talk about AI! #AI #machinelearning"
+        # Remove HTML tags using BeautifulSoup
+        cleaned_data = BeautifulSoup(data, "html.parser").get_text()
+        # Remove URLs using regular expression
+        cleaned_data = re.sub(r'http\S+|www\S+', '', cleaned_data)
+        # Remove emojis using a regular expression (basic pattern for common emojis)
+        cleaned_data = re.sub(r'[^\w\s,]', '', cleaned_data)
+        # Remove hashtags (words starting with #)
+        cleaned_data = re.sub(r'#\w+', '', cleaned_data)
+        st.write(f"Cleaned Data: {cleaned_data}")
+        ```
+        **Output**: After cleaning, the data will look like:
+        ```
+        Check out this amazing post  awesome  data
+        Visit us at   Let's talk about AI  machinelearning
+        ```
+        By following these preprocessing steps, the raw text is now ready for further analysis or machine learning tasks.
         """)
     elif lifecycle_option == "Feature Engineering":
         st.write("""
         #### 📝 5. Text Representation