Harika22 commited on
Commit
7b6ca16
·
verified ·
1 Parent(s): 3ae9d34

Update pages/5_Pre-procesing_of_text.py

Browse files
Files changed (1) hide show
  1. pages/5_Pre-procesing_of_text.py +46 -0
pages/5_Pre-procesing_of_text.py CHANGED
@@ -55,3 +55,49 @@ st.markdown(
55
  unsafe_allow_html=True,
56
  )
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  unsafe_allow_html=True,
56
  )
57
 
58
+ st.header(":blue[Pre-processing of Text🗺️]")
59
+ st.markdown(
60
+ '''
61
+ <div class='section'>
62
+ We will convert raw data into pre-processed data in 3 ways
63
+ - **Cleaning** ---> which is based on the problem statement
64
+ - **Simple pr-processing**
65
+ - **Advance pre-processing**
66
+ </div>
67
+ ''',
68
+ unsafe_allow_html=True,
69
+ )
70
+ st.markdown('''
71
+ - Take a raw text and convert every character and word into single case
72
+ - either upper case
73
+ - or lower case
74
+ - based on the problem statement
75
+ - Because as the dimensionality increases Ml performnace decreases as ML needs tabular data where every column is dimension
76
+ - Same as with urls and tags based on the problem statement
77
+ - if the problem statemnt says preserve the data we shouldn't remove those urls and tags
78
+ - Coming to mentions , digits and mails we can remove those data
79
+ - Whereas emojis can't be removed because nowadays emojis plays a key role in information , so to preserve the information we willn't remove the emojis
80
+ - When the problem statement says preserve the grammar then punctuations shouldn't be removed
81
+ ''')
82
+
83
+
84
+ st.subheader(":red[Data Pre-processing]")
85
+ st.markdown(
86
+ '''
87
+ <div class='section'>
88
+ Converts raw data into pre-processed data
89
+ - which has 2 benefits:;
90
+ - Reduce the dimensionality ---> to increase the performance of ML
91
+ - Raw data - preprocessed data ---> required by the problem statement
92
+ <ul>
93
+ <li><b>Converting into particular case</b> So that highly we can reduce the dimensionalty.If the problem statement says that grammar should be preserved then no need of conversion</li>
94
+ <li><b>Removing URL's / tags/mails/mentions</b> Converting or preserving information should be based on the problem statement</li>
95
+ <li><b>Handling Emoji's</b> Emoji's data should be preserved</li>
96
+ <li><b>Contractions and acronyms</b>Both the contractions and acronyms should be converted into general text</li>
97
+ <li><b>Stop Words</b> Stop words make the grammar very clear
98
+ <li><b>Stemming and Lemmatization</b>Both are purely based on problm statement and if problem statement wants grammatical concept don't perform stemming</li></li>
99
+ </ul>
100
+ </div>
101
+ ''',
102
+ unsafe_allow_html=True,
103
+ )