Spaces:
Sleeping
Sleeping
Update stages/feature_engineering.py
Browse files
stages/feature_engineering.py
CHANGED
|
@@ -1,5 +1,26 @@
|
|
| 1 |
-
import streamlit as st
|
| 2 |
-
|
| 3 |
-
def main():
|
| 4 |
-
st.title("Feature Engineering")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
main()
|
|
|
|
| 1 |
+
import streamlit as st
|
| 2 |
+
|
| 3 |
+
def main():
|
| 4 |
+
st.title("Step 6: Feature Engineering")
|
| 5 |
+
|
| 6 |
+
st.markdown("""
|
| 7 |
+
### **:mag: What is Text Vectorization?** :bar_chart:
|
| 8 |
+
|
| 9 |
+
**Feature Engineering** for text data mainly involves **Text Vectorization**, which is the key process of transforming unstructured text data into numerical form so that machine learning models can understand and learn from it. Why do we need this?
|
| 10 |
+
|
| 11 |
+
**:brain: Why is Vectorization Necessary?**
|
| 12 |
+
- **Machine Learning Models Understand Only Numbers**: Algorithms can only work with numerical data, so raw text needs to be converted into numbers for the model to process it.
|
| 13 |
+
- **Makes Text Usable for Models**: Vectorization translates the meaning of words, sentences, and documents into a format that models can interpret. This allows them to identify patterns, relationships, and insights.
|
| 14 |
+
|
| 15 |
+
**:bulb: Think about it this way**: Text is like a language, but models only speak in numbers. Vectorization is the translator that helps them communicate effectively!
|
| 16 |
+
|
| 17 |
+
**Common Vectorization Techniques:**
|
| 18 |
+
- **Bag of Words (BoW)**: Counts word occurrences in a document, turning them into a numerical vector.
|
| 19 |
+
- **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs words based on how often they appear in the document and how unique they are across all documents.
|
| 20 |
+
- **Word Embeddings**: Uses models like **Word2Vec**, **GloVe**, and **FastText** to represent words as dense vectors, capturing semantic meaning and context.
|
| 21 |
+
|
| 22 |
+
**:key: In Short**: Vectorization is crucial because it transforms raw text into a format that machine learning models can process and understand. Without it, models would have no way to interpret the meaning hidden in the text data.
|
| 23 |
+
""")
|
| 24 |
+
|
| 25 |
+
st.divider()
|
| 26 |
main()
|