Update pages/5_Pre-procesing_of_text.py
Browse files- pages/5_Pre-procesing_of_text.py +18 -23
pages/5_Pre-procesing_of_text.py
CHANGED
|
@@ -90,29 +90,7 @@ st.markdown('''
|
|
| 90 |
''')
|
| 91 |
|
| 92 |
|
| 93 |
-
|
| 94 |
-
st.markdown(
|
| 95 |
-
'''
|
| 96 |
-
<div class='section'>
|
| 97 |
-
Converts raw data into pre-processed data
|
| 98 |
-
|
| 99 |
-
which has 2 benefits:
|
| 100 |
-
|
| 101 |
-
Reduce the dimensionality ---> to increase the performance of ML
|
| 102 |
-
|
| 103 |
-
Raw data - preprocessed data ---> required by the problem statement
|
| 104 |
-
<ul>
|
| 105 |
-
<li><b>Converting into particular case</b>So that highly we can reduce the dimensionalty,if the problem statement says that grammar should be preserved then no need of conversion</li>
|
| 106 |
-
<li><b>Removing URL's / tags/mails/mentions</b>Converting or preserving information should be based on the problem statement</li>
|
| 107 |
-
<li><b>Handling Emoji's</b>Emoji's data should be preserved</li>
|
| 108 |
-
<li><b>Contractions and acronyms</b>Both the contractions and acronyms should be converted into general text</li>
|
| 109 |
-
<li><b>Stop Words</b>Stop words make the grammar very clear</li>
|
| 110 |
-
<li><b>Stemming and Lemmatization</b>Both are purely based on problem statement and if problem statement wants grammatical concept don't perform stemming</li>
|
| 111 |
-
</ul>
|
| 112 |
-
</div>
|
| 113 |
-
''',
|
| 114 |
-
unsafe_allow_html=True,
|
| 115 |
-
)
|
| 116 |
|
| 117 |
st.markdown(
|
| 118 |
"""
|
|
@@ -121,3 +99,20 @@ st.markdown(
|
|
| 121 |
unsafe_allow_html=True,
|
| 122 |
)
|
| 123 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
''')
|
| 91 |
|
| 92 |
|
| 93 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
st.markdown(
|
| 96 |
"""
|
|
|
|
| 99 |
unsafe_allow_html=True,
|
| 100 |
)
|
| 101 |
|
| 102 |
+
st.markdown("<div class='section'>", unsafe_allow_html=True)
|
| 103 |
+
st.markdown("<h2 class='title'>π NLP Data Preprocessing</h2>", unsafe_allow_html=True)
|
| 104 |
+
st.markdown("<p class='subtitle'>Transforming raw text into structured data for better ML performance</p>", unsafe_allow_html=True)
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
st.success("π **Benefits of Preprocessing:**\n\nβ
Reduces dimensionality\n\nβ
Improves ML performance\n\nβ
Converts raw text into problem-specific structured data")
|
| 108 |
+
|
| 109 |
+
st.markdown("### β¨ **Essential Preprocessing Steps:**")
|
| 110 |
+
|
| 111 |
+
st.markdown("β
**Converting Text Case** β Reduces dimensionality; case conversion depends on problem statement.")
|
| 112 |
+
st.markdown("β
**Removing URLs, Tags, and Mentions** β Retain only if required by the problem statement.")
|
| 113 |
+
st.markdown("β
**Handling Emojis** β Preserve or convert emoji data based on context.")
|
| 114 |
+
st.markdown("β
**Expanding Contractions & Acronyms** β Convert abbreviations into standard text.")
|
| 115 |
+
st.markdown("β
**Stop Words Removal** β Optional, useful for text simplification.")
|
| 116 |
+
st.markdown("β
**Stemming & Lemmatization** β Perform only if grammar is **not** crucial for analysis.")
|
| 117 |
+
|
| 118 |
+
st.markdown("</div>", unsafe_allow_html=True)
|