Update app.py
Browse files
app.py
CHANGED
|
@@ -223,18 +223,60 @@ elif st.session_state.selected_page == "NLP Lifecycle":
|
|
| 223 |
st.write("""
|
| 224 |
#### π§Ή 4. Text Preprocessing
|
| 225 |
Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
|
|
|
|
|
|
|
| 226 |
- **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
|
| 227 |
- **Stop Words Removal**: Removing common words that donβt contribute much information.
|
| 228 |
- **Lemmatization**: Converting words into their base or dictionary form.
|
| 229 |
- **Stemming**: Cutting off prefixes or suffixes from words.
|
| 230 |
- **Lowercasing**: Converting all characters in the text to lowercase.
|
| 231 |
-
|
| 232 |
-
**
|
| 233 |
-
-
|
| 234 |
-
-
|
| 235 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
""")
|
| 237 |
|
|
|
|
| 238 |
elif lifecycle_option == "Feature Engineering":
|
| 239 |
st.write("""
|
| 240 |
#### π 5. Text Representation
|
|
|
|
| 223 |
st.write("""
|
| 224 |
#### π§Ή 4. Text Preprocessing
|
| 225 |
Text preprocessing prepares raw text for further analysis. This stage involves cleaning and transforming the data into a structured format that machine learning models can understand.
|
| 226 |
+
|
| 227 |
+
**Key Steps in Text Preprocessing:**
|
| 228 |
- **Tokenization**: Splitting text into smaller units (e.g., words, phrases).
|
| 229 |
- **Stop Words Removal**: Removing common words that donβt contribute much information.
|
| 230 |
- **Lemmatization**: Converting words into their base or dictionary form.
|
| 231 |
- **Stemming**: Cutting off prefixes or suffixes from words.
|
| 232 |
- **Lowercasing**: Converting all characters in the text to lowercase.
|
| 233 |
+
- **HTML Tag Removal**: Eliminating any HTML tags like `<p>`, `<a>`, `<b>`, etc.
|
| 234 |
+
- **URL Removal**: Stripping out URLs such as `http://example.com` or `www.example.com`.
|
| 235 |
+
- **Emoji Removal**: Removing emojis (e.g., π, π) as they are typically non-informative for analysis.
|
| 236 |
+
- **Hashtag Removal**: Removing hashtags (e.g., `#data`, `#AI`) that might not be relevant for textual analysis.
|
| 237 |
+
- **Special Characters Removal**: Stripping out symbols or characters that don't contribute to the meaning of the text.
|
| 238 |
+
|
| 239 |
+
**Example**: For the sentence "The quick brown fox is running fast π¦ #awesome http://example.com", after preprocessing:
|
| 240 |
+
- Tokenization: ["The", "quick", "brown", "fox", "is", "running", "fast", "π¦", "#awesome", "http://example.com"]
|
| 241 |
+
- HTML Tag Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "π¦", "#awesome", "http://example.com"]
|
| 242 |
+
- URL Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "π¦", "#awesome"]
|
| 243 |
+
- Emoji Removal: ["The", "quick", "brown", "fox", "is", "running", "fast", "#awesome"]
|
| 244 |
+
- Hashtag Removal: ["The", "quick", "brown", "fox", "is", "running", "fast"]
|
| 245 |
+
|
| 246 |
+
Now, let's apply the necessary text preprocessing steps to clean up the data:
|
| 247 |
+
|
| 248 |
+
```python
|
| 249 |
+
import re
|
| 250 |
+
from bs4 import BeautifulSoup
|
| 251 |
+
|
| 252 |
+
# Sample data
|
| 253 |
+
data = "Check out this amazing post! π #awesome #data http://example.com Visit us at www.example.com! π Let's talk about AI! #AI #machinelearning"
|
| 254 |
+
|
| 255 |
+
# Remove HTML tags using BeautifulSoup
|
| 256 |
+
cleaned_data = BeautifulSoup(data, "html.parser").get_text()
|
| 257 |
+
|
| 258 |
+
# Remove URLs using regular expression
|
| 259 |
+
cleaned_data = re.sub(r'http\S+|www\S+', '', cleaned_data)
|
| 260 |
+
|
| 261 |
+
# Remove emojis using a regular expression (basic pattern for common emojis)
|
| 262 |
+
cleaned_data = re.sub(r'[^\w\s,]', '', cleaned_data)
|
| 263 |
+
|
| 264 |
+
# Remove hashtags (words starting with #)
|
| 265 |
+
cleaned_data = re.sub(r'#\w+', '', cleaned_data)
|
| 266 |
+
|
| 267 |
+
st.write(f"Cleaned Data: {cleaned_data}")
|
| 268 |
+
```
|
| 269 |
+
|
| 270 |
+
**Output**: After cleaning, the data will look like:
|
| 271 |
+
```
|
| 272 |
+
Check out this amazing post awesome data
|
| 273 |
+
Visit us at Let's talk about AI machinelearning
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
By following these preprocessing steps, the raw text is now ready for further analysis or machine learning tasks.
|
| 277 |
""")
|
| 278 |
|
| 279 |
+
|
| 280 |
elif lifecycle_option == "Feature Engineering":
|
| 281 |
st.write("""
|
| 282 |
#### π 5. Text Representation
|