Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2, 2025

Commit

ea2f020

verified ·

1 Parent(s): 747b9ef

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +64 -0

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -703,3 +703,67 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     unsafe_allow_html=True,
     )

     unsafe_allow_html=True,
     )
+    st.subheader(":red[Advantages]")
+    st.markdown('''
+    - Easy to implement
+    - Can convert into tabular format
+    - It gives importance to both frequently occuring word and rarely occuring in corpus
+    ''')
+    st.subheader(":red[Disadvantages]")
+    st.subheader(":blue[Curse of Dimensionality]")
+    st.markdown('''
+        - Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
+        - Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
+        - As the corpus increases , vocabulary increases -- dimensionality increses
+    ''')
+    st.subheader(":blue[Sparsity]")
+    st.markdown('''
+        - The vector which is created using BOW gives sparse vector
+        - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is baised towards zero values as the data is sparse data
+        - This issue in ML is known as overfitting
+        - It is solved in Deep learning
+    ''')
+    st.subheader(":blue[Out of Vocabulary Issue]")
+    st.markdown('''
+        - Document only converted during training time and we're giving our own dataset
+        - If the word is not present in our dataset while training it can't convert into vector format results in key error
+        - This is solved by Fasttext
+    ''')
+    st.subheader(":blue[Inability to Preserve Semantic Meaning]")
+    st.markdown('''
+    - It slightly preserves semantic meaning
+    ''')
+    st.subheader(":blue[Lack of Sequential Information]")
+    st.markdown('''
+        - Sequential information is not preserved
+        - Because in TF-IDF we're giving importance to words as we're doing word tokenization
+        - In ML no algorithm is capable of preserving sequential information
+        - This is only solved by Deep-learning concept
+        - But by applying a trick to BOW/BBOW/TF-IDF we can slightly preserve sequential information
+        - That technique is known as n-gram
+    ''')
+    st.header(":red[n-gram]")
+    st.markdown('''
+    - n-gram default will always be 1-gram in BOW/BBOW/TF-IDF
+    - Based on n-gram onlt it can create a vocabulary
+    - n- gram is mostly used upto 1,2,3 gram only because as dimension increases ML performance decreases
+    - n-gram is used to slightly preserve sequential information
+    ''')
+    st.code('''
+            from sklearn.feature_extraction.text import TfidfVectorizer\
+            corpus = pd.DataFrame({"Review":["biryani is is is is  résume is good","biryani biryani biryani is not good","biryani is too costly"]})
+            tf = TfidfVectorizer()
+            vector = tf.fit_transform(corpus["Review"])
+            vector.toarray()
+            tf.vocabulary_
+    ''')