import streamlit as st st.markdown(""" """, unsafe_allow_html=True) st.header("Vectorization🧭") st.markdown( """

Vectorization is the process of converting text into vector.

This allows ML models to process text data effectively.

""", unsafe_allow_html=True ) st.markdown(""" There are advance vectorization techniques.They are :

Word Embedding
Word2Vec
Fasttext

""", unsafe_allow_html=True) st.sidebar.title("Navigation 🧭") file_type = st.sidebar.radio( "Choose a Vectorization technique :", ("Word2Vec", "Fasttext")) st.header("Word Embedding Technique") st.markdown(''' - It is a advanced vectorization technique it converts text into vectors in such a way that it preserves semantic meaning - All the techniques which preserves semantic meaning while converting text into vector is word embedding technique - There are 2 word embedding techniques: - Word2Vec - Fasttext ''') if file_type == "Word2Vec": st.title(":red[Word2Vec]") st.markdown( """

📌 How Word2Vec Works?

After training, we obtain the final Word2Vec model
The model stores a dictionary with word-vector pairs:

        { w1: [v1], w2: [v2], w3: [v3] }

""", unsafe_allow_html=True, ) st.markdown( """

⚙️ Training vs. Test Time

Training Time: Corpus + Deep Learning Algorithm → Generates Model
Test Time: Word → Looked up in Dictionary → Returns Vector Representation

""", unsafe_allow_html=True, ) st.markdown( """

🔍 How Does It Preserve Meaning?

It learns from the context of words in the corpus
When given a word, it checks in the dictionary and retrieves the semantic vector
Unlike other models, dimensions are not words, but their meanings

""", unsafe_allow_html=True, ) st.markdown( """

📚 Why is Corpus Important?

The Word2Vec algorithm is completely dependent on the corpus
Better corpus → Better word representation
It preserves semantic meaning using neighborhood words (context)

""", unsafe_allow_html=True, ) st.markdown(''' - Word2Vec is not converting document into vector, it is converting word to vector - There are 2 techniques by using which we can convert entire document into vector - They are : - Average Word2Vec - TIF-IDF Word2Vec ''') st.subheader(":blue[Average Word2Vec]") st.markdown( """

📌 Step-by-Step Process

Given a document d1: w1, w2, w3
Retrieve vector representations v1, v2, v3 from Word2Vec

Perform element-wise addition of vectors:

                v_total = v1 + v2 + v3

Normalize by dividing by the total number of words (element-wise division):

                v_avg = v_total / len(d1)

Final representation contains the average meaning of all words

""", unsafe_allow_html=True, ) st.markdown( """

⚠️ Problem: Equal Importance to Every Word

Word2Vec assigns equal weight to all words
No emphasis on important words that carry significant meaning
This limits the effectiveness in understanding word importance

""", unsafe_allow_html=True, ) st.markdown( """ Word2Vec averages word meanings, but lacks weightage for important words! """, unsafe_allow_html=True, ) st.subheader(":blue[TF-IDF Word2Vec]") st.markdown( """

⚠️ Issue with Word2Vec

Gives equal importance to every word
Even words that appear frequently in a document but rarely in the corpus get equal weight

""", unsafe_allow_html=True, ) st.markdown( """

🚀 Solution: Adding Weightage

Consider a document with 3 words: w1, w2, w3

Each word has a vector representation:

                w1 → v1,  w2 → v2,  w3 → v3

We use two models:
- TF-IDF → Computes weightage for each word
- Word2Vec → Converts words into vectors
For each word, multiply its TF-IDF value with its vector

""", unsafe_allow_html=True, ) st.markdown( """ Final Weighted Representation:

        v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3) 
                 / (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3))

""", unsafe_allow_html=True, ) st.subheader("How to train our own W2V model") st.markdown(''' - At training time Corpus + W2V algorithm can be implemented by 2 techniques - They are: - Skip-gram - CBOW ''') st.subheader(":red[CBOW]") st.markdown( """

What is CBOW?

CBOW (Continuous Bag of Words) is a technique where we use surrounding words (context) to predict the target word (focus word).

""", unsafe_allow_html=True, ) st.markdown( """

📂 Example Corpus

d1: w1, w2, w3, w4, w5, w4
d2: w3, w4, w5, w2, w1, w2, w3, w4

We first preprocess the data to extract meaningful relationships.

""", unsafe_allow_html=True, ) st.markdown( """

📌 Steps to Process the Data

Create a vocabulary from the entire corpus:
```
{w1, w2, w3, w4, w5}
```
Generate a tabular dataset with:
- Feature variables (Context Words)
- Class variables (Target Words)
Apply a window size of 2 (how many neighbors we consider).
Slide the window over the text with slide = 1.

""", unsafe_allow_html=True, ) st.markdown( """

Handling Variable Context Length

To ensure a consistent feature length, we use zero-padding when needed.
The model tries to understand relationships based on the surrounding context words.

""", unsafe_allow_html=True, ) st.markdown( """ Mathematical Representation:

        y = f(xi)
        where,
        y = Focus Word (Target)
        xi = Context Words (Neighbors)

""", unsafe_allow_html=True, ) st.markdown( """

Training with Artificial Neural Networks

The tabular data is passed to an Artificial Neural Network (ANN) which learns:

How context words are related to focus words.

""", unsafe_allow_html=True, ) st.subheader(":red[Skipgram]") st.markdown( """

What is Skipgram?

Skipgram is a technique where we use focus words to predict the context words.

""", unsafe_allow_html=True, ) st.markdown( """

📂 Example Corpus

d1: w1, w2, w3, w4, w5, w4
d2: w3, w4, w5, w2, w1, w2, w3, w4

We first preprocess the data to extract meaningful relationships.

""", unsafe_allow_html=True, ) st.markdown( """

📌 Steps to Process the Data

Create a vocabulary from the entire corpus:
```
{w1, w2, w3, w4, w5}
```
Generate a tabular dataset with:
- Feature variables (Focus Words)
- Class variables (Context Words)
Apply a window size of 2 (how many neighbors we consider).
Slide the window over the text with slide = 1.

""", unsafe_allow_html=True, ) st.markdown( """

Handling Variable Context Length

To ensure a consistent feature length, we use zero-padding when needed.
The model tries to understand relationshipsfocus words.

""", unsafe_allow_html=True, ) st.markdown( """ Mathematical Representation:

        y = f(xi)
        where,
        y = Context Word 
        xi = Focus Words

""", unsafe_allow_html=True, ) st.markdown( """

Training with Artificial Neural Networks

The tabular data is passed to an Artificial Neural Network (ANN) which learns:

How focus words are related with context words.

""", unsafe_allow_html=True, ) elif file_type == "Fasttext": st.title(":red[Fasttext]") st.markdown( """

FastText is an advanced word vectorization technique that enhances word embeddings by considering subword information.

It is a simple extension of Word2Vec, which converts words into vectors.

""", unsafe_allow_html=True, ) st.markdown( """

Implementing FastText

FastText can be implemented using:

CBOW (Continuous Bag of Words)
Skip-gram

""", unsafe_allow_html=True, ) st.markdown( """ CBOW Representation:

        y = f(xi)
        where,
        y = Focus Word
        xi = Context Words

Skip-gram Representation:

        y = f(xi)
        where,
        y = Context Words
        xi = Focus Word

""", unsafe_allow_html=True, ) st.markdown( """

Problem: Out-of-Vocabulary (OOV)

Traditional word embedding techniques fail when encountering new or rare words.

FastText overcomes this issue by breaking words into subword units (character n-grams).

""", unsafe_allow_html=True, ) st.markdown( """

Implementing CBOW with Character N-Grams

Window Size: 5
Window: 2
Slide: 1

A tabular format is created with context words and focus words.

""", unsafe_allow_html=True, ) st.markdown( """ ## Example Sentences: - **d1:** "apple is good for health" - **d2:** "biryani is not good for health" This application creates a table for **context words** and **focus words** using **character 2-grams**. """ ) st.markdown(''' -Character 2-Gram Table: - "Context Words": ["ap", "pp", "pl", "le", "is"] - "Focus Words": ["go", "oo", "od"] ''') st.markdown( """ - This representation provides an **average 2D vector** for words. """ ) st.markdown( """

Vocabulary

The vocabulary consists of unique character n-grams.

        { keys: values }
        where,
        - Keys: Character n-grams
        - Values: Vector representations

""", unsafe_allow_html=True, ) st.markdown( """

FastText Model

The dictionary created is the FastText model.
Text is broken down into character n-grams to generate vector representations.
It follows element-wise addition, giving an average 2D representation of the word.

""", unsafe_allow_html=True, )