Update pages/5_Pre-procesing_of_text.py
Browse files
pages/5_Pre-procesing_of_text.py
CHANGED
|
@@ -70,11 +70,16 @@ st.markdown(
|
|
| 70 |
st.markdown('''
|
| 71 |
- Take a raw text and convert every character and word into single case
|
| 72 |
- either upper case
|
|
|
|
| 73 |
- or lower case
|
|
|
|
| 74 |
- based on the problem statement
|
|
|
|
| 75 |
- Because as the dimensionality increases Ml performnace decreases as ML needs tabular data where every column is dimension
|
| 76 |
- Same as with urls and tags based on the problem statement
|
|
|
|
| 77 |
- if the problem statemnt says preserve the data we shouldn't remove those urls and tags
|
|
|
|
| 78 |
- Coming to mentions , digits and mails we can remove those data
|
| 79 |
- Whereas emojis can't be removed because nowadays emojis plays a key role in information , so to preserve the information we willn't remove the emojis
|
| 80 |
- When the problem statement says preserve the grammar then punctuations shouldn't be removed
|
|
@@ -86,8 +91,10 @@ st.markdown(
|
|
| 86 |
'''
|
| 87 |
<div class='section'>
|
| 88 |
Converts raw data into pre-processed data
|
| 89 |
-
- which has 2 benefits
|
|
|
|
| 90 |
- Reduce the dimensionality ---> to increase the performance of ML
|
|
|
|
| 91 |
- Raw data - preprocessed data ---> required by the problem statement
|
| 92 |
<ul>
|
| 93 |
<li><b>Converting into particular case</b> So that highly we can reduce the dimensionalty.If the problem statement says that grammar should be preserved then no need of conversion</li>
|
|
|
|
| 70 |
st.markdown('''
|
| 71 |
- Take a raw text and convert every character and word into single case
|
| 72 |
- either upper case
|
| 73 |
+
|
| 74 |
- or lower case
|
| 75 |
+
|
| 76 |
- based on the problem statement
|
| 77 |
+
|
| 78 |
- Because as the dimensionality increases Ml performnace decreases as ML needs tabular data where every column is dimension
|
| 79 |
- Same as with urls and tags based on the problem statement
|
| 80 |
+
|
| 81 |
- if the problem statemnt says preserve the data we shouldn't remove those urls and tags
|
| 82 |
+
|
| 83 |
- Coming to mentions , digits and mails we can remove those data
|
| 84 |
- Whereas emojis can't be removed because nowadays emojis plays a key role in information , so to preserve the information we willn't remove the emojis
|
| 85 |
- When the problem statement says preserve the grammar then punctuations shouldn't be removed
|
|
|
|
| 91 |
'''
|
| 92 |
<div class='section'>
|
| 93 |
Converts raw data into pre-processed data
|
| 94 |
+
- which has 2 benefits
|
| 95 |
+
|
| 96 |
- Reduce the dimensionality ---> to increase the performance of ML
|
| 97 |
+
|
| 98 |
- Raw data - preprocessed data ---> required by the problem statement
|
| 99 |
<ul>
|
| 100 |
<li><b>Converting into particular case</b> So that highly we can reduce the dimensionalty.If the problem statement says that grammar should be preserved then no need of conversion</li>
|