Update pages/5_Pre-procesing_of_text.py
Browse files- pages/5_Pre-procesing_of_text.py +14 -31
pages/5_Pre-procesing_of_text.py
CHANGED
|
@@ -54,43 +54,26 @@ st.markdown(
|
|
| 54 |
""",
|
| 55 |
unsafe_allow_html=True,
|
| 56 |
)
|
|
|
|
| 57 |
|
| 58 |
-
st.
|
| 59 |
-
st.markdown(
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
Cleaning - which is based on the problem statement
|
| 65 |
-
|
| 66 |
-
Simple pre-processing
|
| 67 |
-
|
| 68 |
-
Advance pre-processing
|
| 69 |
-
</div>
|
| 70 |
-
''',
|
| 71 |
-
unsafe_allow_html=True,
|
| 72 |
-
)
|
| 73 |
-
st.markdown('''
|
| 74 |
-
- Take a raw text and convert every character and word into single case
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
- or lower case
|
| 79 |
-
|
| 80 |
-
- based on the problem statement
|
| 81 |
-
|
| 82 |
-
- Because as the dimensionality increases Ml performnace decreases as ML needs tabular data where every column is dimension
|
| 83 |
-
- Same as with urls and tags based on the problem statement
|
| 84 |
|
| 85 |
-
- if the problem statemnt says preserve the data we shouldn't remove those urls and tags
|
| 86 |
-
|
| 87 |
-
- Coming to mentions , digits and mails we can remove those data
|
| 88 |
-
- Whereas emojis can't be removed because nowadays emojis plays a key role in information , so to preserve the information we willn't remove the emojis
|
| 89 |
-
- When the problem statement says preserve the grammar then punctuations shouldn't be removed
|
| 90 |
-
''')
|
| 91 |
|
|
|
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
|
|
|
| 94 |
|
| 95 |
st.markdown(
|
| 96 |
"""
|
|
|
|
| 54 |
""",
|
| 55 |
unsafe_allow_html=True,
|
| 56 |
)
|
| 57 |
+
st.header(":blue[β¨ Pre-processing of Text πΊοΈ]")
|
| 58 |
|
| 59 |
+
st.markdown("<div class='section'>", unsafe_allow_html=True)
|
| 60 |
+
st.markdown("<h2 class='title'>π Transforming Raw Text</h2>", unsafe_allow_html=True)
|
| 61 |
+
st.markdown("<p class='subtitle'>Convert unstructured text into a clean and structured format</p>", unsafe_allow_html=True)
|
| 62 |
+
|
| 63 |
+
st.info("π **We preprocess text in three key ways:**\n\nβ
Cleaning - Problem-specific\n\nβ
Simple Pre-processing\n\nβ
Advanced Pre-processing")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
st.markdown("</div>", unsafe_allow_html=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
st.markdown("### β¨ **Essential Preprocessing Techniques:**")
|
| 69 |
|
| 70 |
+
st.markdown("β
**Convert Text Case** β Convert all words to **uppercase** or **lowercase** to maintain consistency and reduce dimensions.")
|
| 71 |
+
st.markdown("β
**Handle URLs and Tags** β Based on problem statement, either remove or preserve them.")
|
| 72 |
+
st.markdown("β
**Mentions, Digits, Emails** β Generally removed unless required by the analysis.")
|
| 73 |
+
st.markdown("β
**Preserve Emojis** β Emojis carry sentiment and play a crucial role in NLP tasks.")
|
| 74 |
+
st.markdown("β
**Grammar Preservation** β If grammar is needed, avoid removing punctuation.")
|
| 75 |
|
| 76 |
+
st.success("π Well-structured and clean text significantly boosts ML model performance!")
|
| 77 |
|
| 78 |
st.markdown(
|
| 79 |
"""
|