NLPHub / stages /feature_engineering.py
NeonSamurai's picture
Update stages/feature_engineering.py
4298273 verified
import streamlit as st
def main():
st.title("Step 6: Feature Engineering")
st.markdown("""
### **:mag: What is Text Vectorization?** :bar_chart:
**Feature Engineering** for text data mainly involves **Text Vectorization**, which is the key process of transforming unstructured text data into numerical form so that machine learning models can understand and learn from it. Why do we need this?
**:brain: Why is Vectorization Necessary?**
- **Machine Learning Models Understand Only Numbers**: Algorithms can only work with numerical data, so raw text needs to be converted into numbers for the model to process it.
- **Makes Text Usable for Models**: Vectorization translates the meaning of words, sentences, and documents into a format that models can interpret. This allows them to identify patterns, relationships, and insights.
**:bulb: Think about it this way**: Text is like a language, but models only speak in numbers. Vectorization is the translator that helps them communicate effectively!
**Common Vectorization Techniques:**
- **Bag of Words (BoW)**: Counts word occurrences in a document, turning them into a numerical vector.
- **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs words based on how often they appear in the document and how unique they are across all documents.
- **Word Embeddings**: Uses models like **Word2Vec**, **GloVe**, and **FastText** to represent words as dense vectors, capturing semantic meaning and context.
**:key: In Short**: Vectorization is crucial because it transforms raw text into a format that machine learning models can process and understand. Without it, models would have no way to interpret the meaning hidden in the text data.
""")
st.divider()
main()