Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 1

Commit

9c5b037

verified ·

1 Parent(s): dfceb96

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +30 -2

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -190,5 +190,33 @@ if file_type == "One-Hot Vectorization":
         -  This method is useful for transforming text into a numerical format for Machine Learning tasks.
     """)

         -  This method is useful for transforming text into a numerical format for Machine Learning tasks.
     """)
+    st.subheader(":green[Advantages]")
+    st.markdown('''
+    - One-Hot Vectorization is easy to implement
+    ''')
+    st.subheader(":green[Disadvantages]")
+    st.markdown('''
+    - 1.Every document have different no.of words (here we're not converting document to vector , we're converting word to vector)
+        - We can't convert into tabular data
+        - It would be possible to convert into tabular data when we're converting document into vector(this is solved by Bag of Words(BOW))
+    - 2.**Sparsity** - The vector which is created using one-hhot vectorization gives sparse vector
+        - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is biasd towards zero values as the data is sparse data
+        - This issue in ML is known as overfitting
+        - It is solved in Deep learning
+    - 3.**Curse of Dimensionality**
+        - Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
+        - Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
+    - 4.**Out of Vocabulary**
+        - Document only converted during training time and we're giving our own dataset
+        - If the word is not present in our dataset while training it can't convert into vector format results in key error
+        - This is solved by Fasttext
+    - 5.**Unable to preserve semantic meaning of the words
+        - While converting text → vector format (same relationship should be preserved)
+        - We need to convert document into vector in such a way that semantic relationship should be preserved
+        - Similarity ⬆️ and Distance ⬇️
+        - Similarity ∝ 1 / Distance
+        - Distance between vectors should be very small
+        - If this is satisfied then the technique has good semantic meaning
+    - 6.**No Sequential information**
+        - Sequential information is not preserved
+    ''')