Update pages/6_Feature_Engineering.py
Browse files
pages/6_Feature_Engineering.py
CHANGED
|
@@ -190,5 +190,33 @@ if file_type == "One-Hot Vectorization":
|
|
| 190 |
- This method is useful for transforming text into a numerical format for Machine Learning tasks.
|
| 191 |
""")
|
| 192 |
|
| 193 |
-
|
| 194 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
- This method is useful for transforming text into a numerical format for Machine Learning tasks.
|
| 191 |
""")
|
| 192 |
|
| 193 |
+
st.subheader(":green[Advantages]")
|
| 194 |
+
st.markdown('''
|
| 195 |
+
- One-Hot Vectorization is easy to implement
|
| 196 |
+
''')
|
| 197 |
+
st.subheader(":green[Disadvantages]")
|
| 198 |
+
st.markdown('''
|
| 199 |
+
- 1.Every document have different no.of words (here we're not converting document to vector , we're converting word to vector)
|
| 200 |
+
- We can't convert into tabular data
|
| 201 |
+
- It would be possible to convert into tabular data when we're converting document into vector(this is solved by Bag of Words(BOW))
|
| 202 |
+
- 2.**Sparsity** - The vector which is created using one-hhot vectorization gives sparse vector
|
| 203 |
+
- Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is biasd towards zero values as the data is sparse data
|
| 204 |
+
- This issue in ML is known as overfitting
|
| 205 |
+
- It is solved in Deep learning
|
| 206 |
+
- 3.**Curse of Dimensionality**
|
| 207 |
+
- Document increases β Vocabulary β and vector increases β dimensionality also increases β
|
| 208 |
+
- Ml performance decreases β - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
|
| 209 |
+
- 4.**Out of Vocabulary**
|
| 210 |
+
- Document only converted during training time and we're giving our own dataset
|
| 211 |
+
- If the word is not present in our dataset while training it can't convert into vector format results in key error
|
| 212 |
+
- This is solved by Fasttext
|
| 213 |
+
- 5.**Unable to preserve semantic meaning of the words
|
| 214 |
+
- While converting text β vector format (same relationship should be preserved)
|
| 215 |
+
- We need to convert document into vector in such a way that semantic relationship should be preserved
|
| 216 |
+
- Similarity β¬οΈ and Distance β¬οΈ
|
| 217 |
+
- Similarity β 1 / Distance
|
| 218 |
+
- Distance between vectors should be very small
|
| 219 |
+
- If this is satisfied then the technique has good semantic meaning
|
| 220 |
+
- 6.**No Sequential information**
|
| 221 |
+
- Sequential information is not preserved
|
| 222 |
+
''')
|