import streamlit as st def main(): st.title("Step 6: Feature Engineering") st.markdown(""" ### **:mag: What is Text Vectorization?** :bar_chart: **Feature Engineering** for text data mainly involves **Text Vectorization**, which is the key process of transforming unstructured text data into numerical form so that machine learning models can understand and learn from it. Why do we need this? **:brain: Why is Vectorization Necessary?** - **Machine Learning Models Understand Only Numbers**: Algorithms can only work with numerical data, so raw text needs to be converted into numbers for the model to process it. - **Makes Text Usable for Models**: Vectorization translates the meaning of words, sentences, and documents into a format that models can interpret. This allows them to identify patterns, relationships, and insights. **:bulb: Think about it this way**: Text is like a language, but models only speak in numbers. Vectorization is the translator that helps them communicate effectively! **Common Vectorization Techniques:** - **Bag of Words (BoW)**: Counts word occurrences in a document, turning them into a numerical vector. - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Weighs words based on how often they appear in the document and how unique they are across all documents. - **Word Embeddings**: Uses models like **Word2Vec**, **GloVe**, and **FastText** to represent words as dense vectors, capturing semantic meaning and context. **:key: In Short**: Vectorization is crucial because it transforms raw text into a format that machine learning models can process and understand. Without it, models would have no way to interpret the meaning hidden in the text data. """) st.divider() main()