Harika22 commited on
Commit
df9cb07
·
verified ·
1 Parent(s): c0f6e9a

Update pages/6_Feature_Engineering.py

Browse files
Files changed (1) hide show
  1. pages/6_Feature_Engineering.py +6 -7
pages/6_Feature_Engineering.py CHANGED
@@ -195,44 +195,43 @@ if file_type == "One-Hot Vectorization":
195
  - One-Hot Vectorization is easy to implement
196
  ''')
197
  st.subheader(":red[Disadvantages]")
198
- st.markdown("<h1 class='title'>Challenges in One-Hot Vectorization</h1>", unsafe_allow_html=True)
199
 
200
- st.markdown("<h2 class='subtitle'>Different Document Lengths</h2>", unsafe_allow_html=True)
201
  st.markdown(
202
  "<p class='content'>Every document contains a different number of words. Here, we are not converting the entire document into a vector, but rather each word separately. "
203
  "This makes it difficult to structure the data into a tabular format. Converting entire documents into vectors, which is addressed by Bag of Words (BOW), solves this issue.</p>",
204
  unsafe_allow_html=True,
205
  )
206
 
207
- st.markdown("<h2 class='subtitle'>Sparsity</h2>", unsafe_allow_html=True)
208
  st.markdown(
209
  "<p class='content'>The vectors created using one-hot encoding tend to be sparse. When data is given to any algorithm, the model may become biased towards zero values, "
210
  "leading to an issue in machine learning known as overfitting. This problem is primarily addressed in deep learning.</p>",
211
  unsafe_allow_html=True,
212
  )
213
 
214
- st.markdown("<h2 class='subtitle'>Curse of Dimensionality</h2>", unsafe_allow_html=True)
215
  st.markdown(
216
  "<p class='content'>As the number of documents increases, the vocabulary size grows, leading to an increase in dimensionality. This negatively impacts machine learning performance "
217
  "because the dimensionality of vectors is directly dependent on vocabulary size, which grows as more documents are introduced.</p>",
218
  unsafe_allow_html=True,
219
  )
220
 
221
- st.markdown("<h2 class='subtitle'>Out of Vocabulary Issue</h2>", unsafe_allow_html=True)
222
  st.markdown(
223
  "<p class='content'>One-hot encoding only converts words that were present in the dataset at the time of training. If a new word appears during inference and was not included in the "
224
  "training dataset, it cannot be converted into a vector, causing a key error. This issue is effectively solved by FastText.</p>",
225
  unsafe_allow_html=True,
226
  )
227
 
228
- st.markdown("<h2 class='subtitle'>Inability to Preserve Semantic Meaning</h2>", unsafe_allow_html=True)
229
  st.markdown(
230
  "<p class='content'>When converting text into vector format, the relationships between words should be preserved. Ideally, similar words should be represented by similar vectors, "
231
  "meaning the distance between their vectors should be small. If this is achieved, the vectorization method successfully preserves semantic meaning.</p>",
232
  unsafe_allow_html=True,
233
  )
234
 
235
- st.markdown("<h2 class='subtitle'>Lack of Sequential Information</h2>", unsafe_allow_html=True)
236
  st.markdown(
237
  "<p class='content'>One-hot encoding does not preserve sequential information in text. The order of words, which is crucial in natural language, is completely lost in this encoding method.</p>",
238
  unsafe_allow_html=True,
 
195
  - One-Hot Vectorization is easy to implement
196
  ''')
197
  st.subheader(":red[Disadvantages]")
 
198
 
199
+ st.subheader("blue[Different Document Length]"))
200
  st.markdown(
201
  "<p class='content'>Every document contains a different number of words. Here, we are not converting the entire document into a vector, but rather each word separately. "
202
  "This makes it difficult to structure the data into a tabular format. Converting entire documents into vectors, which is addressed by Bag of Words (BOW), solves this issue.</p>",
203
  unsafe_allow_html=True,
204
  )
205
 
206
+ st.subheader(":blue[Sparsity]")
207
  st.markdown(
208
  "<p class='content'>The vectors created using one-hot encoding tend to be sparse. When data is given to any algorithm, the model may become biased towards zero values, "
209
  "leading to an issue in machine learning known as overfitting. This problem is primarily addressed in deep learning.</p>",
210
  unsafe_allow_html=True,
211
  )
212
 
213
+ st.subheader(":blue[Curse of Dimensionality]")
214
  st.markdown(
215
  "<p class='content'>As the number of documents increases, the vocabulary size grows, leading to an increase in dimensionality. This negatively impacts machine learning performance "
216
  "because the dimensionality of vectors is directly dependent on vocabulary size, which grows as more documents are introduced.</p>",
217
  unsafe_allow_html=True,
218
  )
219
 
220
+ st.subheader(":blue[Out of Vocabulary Issue]")
221
  st.markdown(
222
  "<p class='content'>One-hot encoding only converts words that were present in the dataset at the time of training. If a new word appears during inference and was not included in the "
223
  "training dataset, it cannot be converted into a vector, causing a key error. This issue is effectively solved by FastText.</p>",
224
  unsafe_allow_html=True,
225
  )
226
 
227
+ st.subheader(":blue[Inability to Preserve Semantic Meaning"])
228
  st.markdown(
229
  "<p class='content'>When converting text into vector format, the relationships between words should be preserved. Ideally, similar words should be represented by similar vectors, "
230
  "meaning the distance between their vectors should be small. If this is achieved, the vectorization method successfully preserves semantic meaning.</p>",
231
  unsafe_allow_html=True,
232
  )
233
 
234
+ st.subheader(":blue[Lack of Sequential Information]")
235
  st.markdown(
236
  "<p class='content'>One-hot encoding does not preserve sequential information in text. The order of words, which is crucial in natural language, is completely lost in this encoding method.</p>",
237
  unsafe_allow_html=True,