Update pages/6_Feature_Engineering.py
Browse files
pages/6_Feature_Engineering.py
CHANGED
|
@@ -195,44 +195,43 @@ if file_type == "One-Hot Vectorization":
|
|
| 195 |
- One-Hot Vectorization is easy to implement
|
| 196 |
''')
|
| 197 |
st.subheader(":red[Disadvantages]")
|
| 198 |
-
st.markdown("<h1 class='title'>Challenges in One-Hot Vectorization</h1>", unsafe_allow_html=True)
|
| 199 |
|
| 200 |
-
st.
|
| 201 |
st.markdown(
|
| 202 |
"<p class='content'>Every document contains a different number of words. Here, we are not converting the entire document into a vector, but rather each word separately. "
|
| 203 |
"This makes it difficult to structure the data into a tabular format. Converting entire documents into vectors, which is addressed by Bag of Words (BOW), solves this issue.</p>",
|
| 204 |
unsafe_allow_html=True,
|
| 205 |
)
|
| 206 |
|
| 207 |
-
st.
|
| 208 |
st.markdown(
|
| 209 |
"<p class='content'>The vectors created using one-hot encoding tend to be sparse. When data is given to any algorithm, the model may become biased towards zero values, "
|
| 210 |
"leading to an issue in machine learning known as overfitting. This problem is primarily addressed in deep learning.</p>",
|
| 211 |
unsafe_allow_html=True,
|
| 212 |
)
|
| 213 |
|
| 214 |
-
st.
|
| 215 |
st.markdown(
|
| 216 |
"<p class='content'>As the number of documents increases, the vocabulary size grows, leading to an increase in dimensionality. This negatively impacts machine learning performance "
|
| 217 |
"because the dimensionality of vectors is directly dependent on vocabulary size, which grows as more documents are introduced.</p>",
|
| 218 |
unsafe_allow_html=True,
|
| 219 |
)
|
| 220 |
|
| 221 |
-
st.
|
| 222 |
st.markdown(
|
| 223 |
"<p class='content'>One-hot encoding only converts words that were present in the dataset at the time of training. If a new word appears during inference and was not included in the "
|
| 224 |
"training dataset, it cannot be converted into a vector, causing a key error. This issue is effectively solved by FastText.</p>",
|
| 225 |
unsafe_allow_html=True,
|
| 226 |
)
|
| 227 |
|
| 228 |
-
st.
|
| 229 |
st.markdown(
|
| 230 |
"<p class='content'>When converting text into vector format, the relationships between words should be preserved. Ideally, similar words should be represented by similar vectors, "
|
| 231 |
"meaning the distance between their vectors should be small. If this is achieved, the vectorization method successfully preserves semantic meaning.</p>",
|
| 232 |
unsafe_allow_html=True,
|
| 233 |
)
|
| 234 |
|
| 235 |
-
st.
|
| 236 |
st.markdown(
|
| 237 |
"<p class='content'>One-hot encoding does not preserve sequential information in text. The order of words, which is crucial in natural language, is completely lost in this encoding method.</p>",
|
| 238 |
unsafe_allow_html=True,
|
|
|
|
| 195 |
- One-Hot Vectorization is easy to implement
|
| 196 |
''')
|
| 197 |
st.subheader(":red[Disadvantages]")
|
|
|
|
| 198 |
|
| 199 |
+
st.subheader("blue[Different Document Length]"))
|
| 200 |
st.markdown(
|
| 201 |
"<p class='content'>Every document contains a different number of words. Here, we are not converting the entire document into a vector, but rather each word separately. "
|
| 202 |
"This makes it difficult to structure the data into a tabular format. Converting entire documents into vectors, which is addressed by Bag of Words (BOW), solves this issue.</p>",
|
| 203 |
unsafe_allow_html=True,
|
| 204 |
)
|
| 205 |
|
| 206 |
+
st.subheader(":blue[Sparsity]")
|
| 207 |
st.markdown(
|
| 208 |
"<p class='content'>The vectors created using one-hot encoding tend to be sparse. When data is given to any algorithm, the model may become biased towards zero values, "
|
| 209 |
"leading to an issue in machine learning known as overfitting. This problem is primarily addressed in deep learning.</p>",
|
| 210 |
unsafe_allow_html=True,
|
| 211 |
)
|
| 212 |
|
| 213 |
+
st.subheader(":blue[Curse of Dimensionality]")
|
| 214 |
st.markdown(
|
| 215 |
"<p class='content'>As the number of documents increases, the vocabulary size grows, leading to an increase in dimensionality. This negatively impacts machine learning performance "
|
| 216 |
"because the dimensionality of vectors is directly dependent on vocabulary size, which grows as more documents are introduced.</p>",
|
| 217 |
unsafe_allow_html=True,
|
| 218 |
)
|
| 219 |
|
| 220 |
+
st.subheader(":blue[Out of Vocabulary Issue]")
|
| 221 |
st.markdown(
|
| 222 |
"<p class='content'>One-hot encoding only converts words that were present in the dataset at the time of training. If a new word appears during inference and was not included in the "
|
| 223 |
"training dataset, it cannot be converted into a vector, causing a key error. This issue is effectively solved by FastText.</p>",
|
| 224 |
unsafe_allow_html=True,
|
| 225 |
)
|
| 226 |
|
| 227 |
+
st.subheader(":blue[Inability to Preserve Semantic Meaning"])
|
| 228 |
st.markdown(
|
| 229 |
"<p class='content'>When converting text into vector format, the relationships between words should be preserved. Ideally, similar words should be represented by similar vectors, "
|
| 230 |
"meaning the distance between their vectors should be small. If this is achieved, the vectorization method successfully preserves semantic meaning.</p>",
|
| 231 |
unsafe_allow_html=True,
|
| 232 |
)
|
| 233 |
|
| 234 |
+
st.subheader(":blue[Lack of Sequential Information]")
|
| 235 |
st.markdown(
|
| 236 |
"<p class='content'>One-hot encoding does not preserve sequential information in text. The order of words, which is crucial in natural language, is completely lost in this encoding method.</p>",
|
| 237 |
unsafe_allow_html=True,
|