import streamlit as st st.markdown(""" """, unsafe_allow_html=True) st.header("Vectorization🧭") st.markdown( """

Vectorization is the process of converting text into vector.

This allows ML models to process text data effectively.

""", unsafe_allow_html=True ) st.markdown(""" There are advance vectorization techniques.They are : """, unsafe_allow_html=True) st.sidebar.title("Navigation 🧭") file_type = st.sidebar.radio( "Choose a Vectorization technique :", ("Word2Vec", "Fasttext")) st.header("Word Embedding Technique") st.markdown(''' - It is a advanced vectorization technique it converts text into vectors in such a way that it preserves semantic meaning - All the techniques which preserves semantic meaning while converting text into vector is word embedding technique - There are 2 word embedding techniques: - Word2Vec - Fasttext ''') if file_type == "Word2Vec": st.title(":red[Word2Vec]") st.markdown( """

πŸ“Œ How Word2Vec Works?

        { w1: [v1], w2: [v2], w3: [v3] }
        
""", unsafe_allow_html=True, ) st.markdown( """

βš™οΈ Training vs. Test Time

""", unsafe_allow_html=True, ) st.markdown( """

πŸ” How Does It Preserve Meaning?

""", unsafe_allow_html=True, ) st.markdown( """

πŸ“š Why is Corpus Important?

""", unsafe_allow_html=True, ) st.markdown(''' - Word2Vec is not converting document into vector, it is converting word to vector - There are 2 techniques by using which we can convert entire document into vector - They are : - Average Word2Vec - TIF-IDF Word2Vec ''') st.subheader(":blue[Average Word2Vec]") st.markdown( """

πŸ“Œ Step-by-Step Process

""", unsafe_allow_html=True, ) st.markdown( """

⚠️ Problem: Equal Importance to Every Word

""", unsafe_allow_html=True, ) st.markdown( """ Word2Vec averages word meanings, but lacks weightage for important words! """, unsafe_allow_html=True, ) st.subheader(":blue[TF-IDF Word2Vec]") st.markdown( """

⚠️ Issue with Word2Vec

""", unsafe_allow_html=True, ) st.markdown( """

πŸš€ Solution: Adding Weightage

""", unsafe_allow_html=True, ) st.markdown( """ Final Weighted Representation:
        v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3) 
                 / (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3))
        
""", unsafe_allow_html=True, ) st.subheader("How to train our own W2V model") st.markdown(''' - At training time Corpus + W2V algorithm can be implemented by 2 techniques - They are: - Skip-gram - CBOW ''') st.subheader(":red[CBOW]") st.markdown( """

What is CBOW?

CBOW (Continuous Bag of Words) is a technique where we use surrounding words (context) to predict the target word (focus word).

""", unsafe_allow_html=True, ) st.markdown( """

πŸ“‚ Example Corpus

We first preprocess the data to extract meaningful relationships.

""", unsafe_allow_html=True, ) st.markdown( """

πŸ“Œ Steps to Process the Data

""", unsafe_allow_html=True, ) st.markdown( """

Handling Variable Context Length

""", unsafe_allow_html=True, ) st.markdown( """ Mathematical Representation:
        y = f(xi)
        where,
        y = Focus Word (Target)
        xi = Context Words (Neighbors)
        
""", unsafe_allow_html=True, ) st.markdown( """

Training with Artificial Neural Networks

The tabular data is passed to an Artificial Neural Network (ANN) which learns:

""", unsafe_allow_html=True, ) st.subheader(":red[Skipgram]") st.markdown( """

What is Skipgram?

Skipgram is a technique where we use focus words to predict the context words.

""", unsafe_allow_html=True, ) st.markdown( """

πŸ“‚ Example Corpus

We first preprocess the data to extract meaningful relationships.

""", unsafe_allow_html=True, ) st.markdown( """

πŸ“Œ Steps to Process the Data

""", unsafe_allow_html=True, ) st.markdown( """

Handling Variable Context Length

""", unsafe_allow_html=True, ) st.markdown( """ Mathematical Representation:
        y = f(xi)
        where,
        y = Context Word 
        xi = Focus Words 
        
""", unsafe_allow_html=True, ) st.markdown( """

Training with Artificial Neural Networks

The tabular data is passed to an Artificial Neural Network (ANN) which learns:

""", unsafe_allow_html=True, ) elif file_type == "Fasttext": st.title(":red[Fasttext]") st.markdown( """

FastText is an advanced word vectorization technique that enhances word embeddings by considering subword information.

It is a simple extension of Word2Vec, which converts words into vectors.

""", unsafe_allow_html=True, ) st.markdown( """

Implementing FastText

FastText can be implemented using:

""", unsafe_allow_html=True, ) st.markdown( """ CBOW Representation:
        y = f(xi)
        where,
        y = Focus Word
        xi = Context Words
        
Skip-gram Representation:
        y = f(xi)
        where,
        y = Context Words
        xi = Focus Word
        
""", unsafe_allow_html=True, ) st.markdown( """

Problem: Out-of-Vocabulary (OOV)

Traditional word embedding techniques fail when encountering new or rare words.

FastText overcomes this issue by breaking words into subword units (character n-grams).

""", unsafe_allow_html=True, ) st.markdown( """

Implementing CBOW with Character N-Grams

A tabular format is created with context words and focus words.

""", unsafe_allow_html=True, ) st.markdown( """ ## Example Sentences: - **d1:** "apple is good for health" - **d2:** "biryani is not good for health" This application creates a table for **context words** and **focus words** using **character 2-grams**. """ ) st.markdown(''' -Character 2-Gram Table: - "Context Words": ["ap", "pp", "pl", "le", "is"] - "Focus Words": ["go", "oo", "od"] ''') st.markdown( """ - This representation provides an **average 2D vector** for words. """ ) st.markdown( """

Vocabulary

The vocabulary consists of unique character n-grams.

        { keys: values }
        where,
        - Keys: Character n-grams
        - Values: Vector representations
        
""", unsafe_allow_html=True, ) st.markdown( """

FastText Model

""", unsafe_allow_html=True, )