Harika22 commited on
Commit
c0f6e9a
Β·
verified Β·
1 Parent(s): 9c5b037

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +44 -27
pages/6_Feature_Engineering.py CHANGED
@@ -190,33 +190,50 @@ if file_type == "One-Hot Vectorization":
190
  - This method is useful for transforming text into a numerical format for Machine Learning tasks.
191
  """)
192
 
193
- st.subheader(":green[Advantages]")
194
  st.markdown('''
195
  - One-Hot Vectorization is easy to implement
196
  ''')
197
- st.subheader(":green[Disadvantages]")
198
- st.markdown('''
199
- - 1.Every document have different no.of words (here we're not converting document to vector , we're converting word to vector)
200
- - We can't convert into tabular data
201
- - It would be possible to convert into tabular data when we're converting document into vector(this is solved by Bag of Words(BOW))
202
- - 2.**Sparsity** - The vector which is created using one-hhot vectorization gives sparse vector
203
- - Entire data is given to any alogorithm and machine is going to learn fom data and algorithm it is biasd towards zero values as the data is sparse data
204
- - This issue in ML is known as overfitting
205
- - It is solved in Deep learning
206
- - 3.**Curse of Dimensionality**
207
- - Document increases ↑ Vocabulary ↑ and vector increases ↑ dimensionality also increases ↑
208
- - Ml performance decreases ↓ - as the dimensionality totally depends on vocabulary and it shootup as the document increases and different
209
- - 4.**Out of Vocabulary**
210
- - Document only converted during training time and we're giving our own dataset
211
- - If the word is not present in our dataset while training it can't convert into vector format results in key error
212
- - This is solved by Fasttext
213
- - 5.**Unable to preserve semantic meaning of the words
214
- - While converting text β†’ vector format (same relationship should be preserved)
215
- - We need to convert document into vector in such a way that semantic relationship should be preserved
216
- - Similarity ⬆️ and Distance ⬇️
217
- - Similarity ∝ 1 / Distance
218
- - Distance between vectors should be very small
219
- - If this is satisfied then the technique has good semantic meaning
220
- - 6.**No Sequential information**
221
- - Sequential information is not preserved
222
- ''')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
  - This method is useful for transforming text into a numerical format for Machine Learning tasks.
191
  """)
192
 
193
+ st.subheader(":red[Advantages]")
194
  st.markdown('''
195
  - One-Hot Vectorization is easy to implement
196
  ''')
197
+ st.subheader(":red[Disadvantages]")
198
+ st.markdown("<h1 class='title'>Challenges in One-Hot Vectorization</h1>", unsafe_allow_html=True)
199
+
200
+ st.markdown("<h2 class='subtitle'>Different Document Lengths</h2>", unsafe_allow_html=True)
201
+ st.markdown(
202
+ "<p class='content'>Every document contains a different number of words. Here, we are not converting the entire document into a vector, but rather each word separately. "
203
+ "This makes it difficult to structure the data into a tabular format. Converting entire documents into vectors, which is addressed by Bag of Words (BOW), solves this issue.</p>",
204
+ unsafe_allow_html=True,
205
+ )
206
+
207
+ st.markdown("<h2 class='subtitle'>Sparsity</h2>", unsafe_allow_html=True)
208
+ st.markdown(
209
+ "<p class='content'>The vectors created using one-hot encoding tend to be sparse. When data is given to any algorithm, the model may become biased towards zero values, "
210
+ "leading to an issue in machine learning known as overfitting. This problem is primarily addressed in deep learning.</p>",
211
+ unsafe_allow_html=True,
212
+ )
213
+
214
+ st.markdown("<h2 class='subtitle'>Curse of Dimensionality</h2>", unsafe_allow_html=True)
215
+ st.markdown(
216
+ "<p class='content'>As the number of documents increases, the vocabulary size grows, leading to an increase in dimensionality. This negatively impacts machine learning performance "
217
+ "because the dimensionality of vectors is directly dependent on vocabulary size, which grows as more documents are introduced.</p>",
218
+ unsafe_allow_html=True,
219
+ )
220
+
221
+ st.markdown("<h2 class='subtitle'>Out of Vocabulary Issue</h2>", unsafe_allow_html=True)
222
+ st.markdown(
223
+ "<p class='content'>One-hot encoding only converts words that were present in the dataset at the time of training. If a new word appears during inference and was not included in the "
224
+ "training dataset, it cannot be converted into a vector, causing a key error. This issue is effectively solved by FastText.</p>",
225
+ unsafe_allow_html=True,
226
+ )
227
+
228
+ st.markdown("<h2 class='subtitle'>Inability to Preserve Semantic Meaning</h2>", unsafe_allow_html=True)
229
+ st.markdown(
230
+ "<p class='content'>When converting text into vector format, the relationships between words should be preserved. Ideally, similar words should be represented by similar vectors, "
231
+ "meaning the distance between their vectors should be small. If this is achieved, the vectorization method successfully preserves semantic meaning.</p>",
232
+ unsafe_allow_html=True,
233
+ )
234
+
235
+ st.markdown("<h2 class='subtitle'>Lack of Sequential Information</h2>", unsafe_allow_html=True)
236
+ st.markdown(
237
+ "<p class='content'>One-hot encoding does not preserve sequential information in text. The order of words, which is crucial in natural language, is completely lost in this encoding method.</p>",
238
+ unsafe_allow_html=True,
239
+ )