Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 1, 2025

Commit

7b6ca16

verified ·

1 Parent(s): 3ae9d34

Update pages/5_Pre-procesing_of_text.py

Browse files

Files changed (1) hide show

pages/5_Pre-procesing_of_text.py +46 -0

pages/5_Pre-procesing_of_text.py CHANGED Viewed

@@ -55,3 +55,49 @@ st.markdown(
     unsafe_allow_html=True,
 )

     unsafe_allow_html=True,
 )
+st.header(":blue[Pre-processing of Text🗺️]")
+st.markdown(
+    '''
+    <div class='section'>
+        We will convert raw data into pre-processed data in 3 ways
+            - **Cleaning** ---> which is based on the problem statement
+            - **Simple pr-processing**
+            - **Advance pre-processing**
+    </div>
+    ''',
+    unsafe_allow_html=True,
+)
+st.markdown('''
+- Take a raw text and convert every character and word into single case
+    - either upper case
+    - or lower case
+    - based on the problem statement
+    - Because as the dimensionality increases Ml performnace decreases as ML needs tabular data where every column is dimension
+- Same as with urls and tags based on the problem statement
+    - if the problem statemnt says preserve the data we shouldn't remove those urls and tags
+- Coming to mentions , digits and mails we can remove those data
+- Whereas emojis can't be removed because nowadays emojis plays a key role in information , so to preserve the information we willn't remove the emojis
+- When the problem statement says preserve the grammar then punctuations shouldn't be removed
+''')
+st.subheader(":red[Data Pre-processing]")
+st.markdown(
+    '''
+    <div class='section'>
+        Converts raw data into pre-processed data
+            - which has 2 benefits:;
+            - Reduce the dimensionality ---> to increase the performance of ML
+            - Raw data - preprocessed data ---> required by the problem statement
+        <ul>
+            <li><b>Converting into particular case</b> So that highly we can reduce the dimensionalty.If the problem statement says that grammar should be preserved then no need of conversion</li>
+            <li><b>Removing URL's / tags/mails/mentions</b> Converting or preserving information should be based on the problem statement</li>
+            <li><b>Handling Emoji's</b> Emoji's data should be preserved</li>
+            <li><b>Contractions and acronyms</b>Both the contractions and acronyms should be converted into general text</li>
+            <li><b>Stop Words</b> Stop words make the grammar very clear
+            <li><b>Stemming and Lemmatization</b>Both are purely based on problm statement and if problem statement wants grammatical concept don't perform stemming</li></li>
+        </ul>
+    </div>
+    ''',
+    unsafe_allow_html=True,
+)