Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2

Commit

79fc281

verified ·

1 Parent(s): 47fbe17

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +60 -2

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -205,8 +205,8 @@ if file_type == "One-Hot Vectorization":
     st.subheader(":blue[Sparsity]")
     st.markdown('''
-        - The vector which is created using one-hhot vectorization gives sparse vector
-        - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is biasd towards zero values as the data is sparse data
         - This issue in ML is known as overfitting
         - It is solved in Deep learning
     ''')
@@ -298,3 +298,61 @@ elif file_type == "Bag of Words(BOW)":
     "<p class='content'>Since all three vectors have the same number of dimensions, we can merge them into a tabular format:</p>",
     unsafe_allow_html=True,
     )

     st.subheader(":blue[Sparsity]")
     st.markdown('''
+        - The vector which is created using one-hot vectorization gives sparse vector
+        - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is baised towards zero values as the data is sparse data
         - This issue in ML is known as overfitting
         - It is solved in Deep learning
     ''')
     "<p class='content'>Since all three vectors have the same number of dimensions, we can merge them into a tabular format:</p>",
     unsafe_allow_html=True,
     )
+    st.subheader(":red[Advantages]")
+    st.markdown('''
+    - Bag of Words(BOW) is easy to implement
+    - Here we can convert the data into tabular data
+    ''')
+    st.subheader(":red[Disadvantages]")
+    st.subheader(":blue[Curse of Dimensionality]")
+    st.markdown('''
+        - Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
+        - Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
+        - As the corpus increases , vocabulary increases -- dimensionality increses
+    ''')
+    st.subheader(":blue[Sparsity]")
+    st.markdown('''
+        - The vector which is created using BOW gives sparse vector
+        - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is baised towards zero values as the data is sparse data
+        - This issue in ML is known as overfitting
+        - It is solved in Deep learning
+    ''')
+    st.subheader(":blue[Out of Vocabulary Issue]")
+    st.markdown('''
+        - Document only converted during training time and we're giving our own dataset
+        - If the word is not present in our dataset while training it can't convert into vector format results in key error
+        - This is solved by Fasttext
+    ''')
+    st.subheader(":blue[Inability to Preserve Semantic Meaning]")
+    st.markdown('''
+    - It can't completely preserve semantic meaning (slightly preserves it)
+    - Here based on count(no.of times the particular word is occuring) it can sometimes preserve semantic meaning
+    - Based on uniqueness of the words the semantic meaning is preserved
+    - More the uniqueness , more the documents will be far away
+    - Less no.of unique words , it'll be close to each other
+    ''')
+    st.subheader(":blue[Lack of Sequential Information]")
+    st.markdown('''
+        - Sequential information is not preserved
+    ''')
+    st.code('''
+            from sklearn.feature_extraction.text import CountVectorizer
+            corpus = pd.DataFrame({"Review":["biryani is is is good","biryani is not good","biryani is too costly"]})
+            ## object of the CountVectorizer class
+            cv = CountVectorizer(lowercase=True,strip_accents="unicode",analyzer="word",stop_words=stp,token_pattern=r"((?u)\b\w\w+\b))")
+            cv.fit(corpus["Review"])  ### learning vocabulary
+            vector = cv.transform(corpus["Review"]) ### it converts into vector form based on cv and vocabulary learned
+            vector.toarray()
+    ''')