import streamlit as st
st.markdown("""
""", unsafe_allow_html=True)
st.header("Vectorizationπ§")
st.markdown(
"""
Vectorization is the process of converting text into vector.
This allows ML models to process text data effectively.
""",
unsafe_allow_html=True
)
st.markdown("""
There are advance vectorization techniques.They are :
- Word Embedding
- Word2Vec
- Fasttext
""", unsafe_allow_html=True)
st.sidebar.title("Navigation π§")
file_type = st.sidebar.radio(
"Choose a Vectorization technique :",
("Word2Vec", "Fasttext"))
st.header("Word Embedding Technique")
st.markdown('''
- It is a advanced vectorization technique it converts text into vectors in such a way that it preserves semantic meaning
- All the techniques which preserves semantic meaning while converting text into vector is word embedding technique
- There are 2 word embedding techniques:
- Word2Vec
- Fasttext
''')
if file_type == "Word2Vec":
st.title(":red[Word2Vec]")
st.markdown(
"""
π How Word2Vec Works?
- After training, we obtain the final Word2Vec model
- The model stores a dictionary with word-vector pairs:
{ w1: [v1], w2: [v2], w3: [v3] }
""",
unsafe_allow_html=True,
)
st.markdown(
"""
βοΈ Training vs. Test Time
- Training Time: Corpus + Deep Learning Algorithm β Generates Model
- Test Time: Word β Looked up in Dictionary β Returns Vector Representation
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π How Does It Preserve Meaning?
- It learns from the context of words in the corpus
- When given a word, it checks in the dictionary and retrieves the semantic vector
- Unlike other models, dimensions are not words, but their meanings
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π Why is Corpus Important?
- The Word2Vec algorithm is completely dependent on the corpus
- Better corpus β Better word representation
- It preserves semantic meaning using neighborhood words (context)
""",
unsafe_allow_html=True,
)
st.markdown('''
- Word2Vec is not converting document into vector, it is converting word to vector
- There are 2 techniques by using which we can convert entire document into vector
- They are :
- Average Word2Vec
- TIF-IDF Word2Vec
''')
st.subheader(":blue[Average Word2Vec]")
st.markdown(
"""
π Step-by-Step Process
""",
unsafe_allow_html=True,
)
st.markdown(
"""
β οΈ Problem: Equal Importance to Every Word
- Word2Vec assigns equal weight to all words
- No emphasis on important words that carry significant meaning
- This limits the effectiveness in understanding word importance
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Word2Vec averages word meanings, but lacks weightage for important words!
""",
unsafe_allow_html=True,
)
st.subheader(":blue[TF-IDF Word2Vec]")
st.markdown(
"""
β οΈ Issue with Word2Vec
- Gives equal importance to every word
- Even words that appear frequently in a document but rarely in the corpus get equal weight
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π Solution: Adding Weightage
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Final Weighted Representation:
v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3)
/ (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3))
""",
unsafe_allow_html=True,
)
st.subheader("How to train our own W2V model")
st.markdown('''
- At training time Corpus + W2V algorithm can be implemented by 2 techniques
- They are:
- Skip-gram
- CBOW
''')
st.subheader(":red[CBOW]")
st.markdown(
"""
What is CBOW?
CBOW (Continuous Bag of Words) is a technique where we use surrounding words (context) to predict the target word (focus word).
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π Example Corpus
- d1: w1, w2, w3, w4, w5, w4
- d2: w3, w4, w5, w2, w1, w2, w3, w4
We first preprocess the data to extract meaningful relationships.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π Steps to Process the Data
- Create a vocabulary from the entire corpus:
{w1, w2, w3, w4, w5}
- Generate a tabular dataset with:
- Feature variables (Context Words)
- Class variables (Target Words)
- Apply a window size of 2 (how many neighbors we consider).
- Slide the window over the text with slide = 1.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Handling Variable Context Length
- To ensure a consistent feature length, we use zero-padding when needed.
- The model tries to understand relationships based on the surrounding context words.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Mathematical Representation:
y = f(xi)
where,
y = Focus Word (Target)
xi = Context Words (Neighbors)
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Training with Artificial Neural Networks
The tabular data is passed to an Artificial Neural Network (ANN) which learns:
- How context words are related to focus words.
""",
unsafe_allow_html=True,
)
st.subheader(":red[Skipgram]")
st.markdown(
"""
What is Skipgram?
Skipgram is a technique where we use focus words to predict the context words.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π Example Corpus
- d1: w1, w2, w3, w4, w5, w4
- d2: w3, w4, w5, w2, w1, w2, w3, w4
We first preprocess the data to extract meaningful relationships.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
π Steps to Process the Data
- Create a vocabulary from the entire corpus:
{w1, w2, w3, w4, w5}
- Generate a tabular dataset with:
- Feature variables (Focus Words)
- Class variables (Context Words)
- Apply a window size of 2 (how many neighbors we consider).
- Slide the window over the text with slide = 1.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Handling Variable Context Length
- To ensure a consistent feature length, we use zero-padding when needed.
- The model tries to understand relationshipsfocus words.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Mathematical Representation:
y = f(xi)
where,
y = Context Word
xi = Focus Words
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Training with Artificial Neural Networks
The tabular data is passed to an Artificial Neural Network (ANN) which learns:
- How focus words are related with context words.
""",
unsafe_allow_html=True,
)
elif file_type == "Fasttext":
st.title(":red[Fasttext]")
st.markdown(
"""
FastText is an advanced word vectorization technique that enhances word embeddings by considering subword information.
It is a simple extension of Word2Vec, which converts words into vectors.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Implementing FastText
FastText can be implemented using:
- CBOW (Continuous Bag of Words)
- Skip-gram
""",
unsafe_allow_html=True,
)
st.markdown(
"""
CBOW Representation:
y = f(xi)
where,
y = Focus Word
xi = Context Words
Skip-gram Representation:
y = f(xi)
where,
y = Context Words
xi = Focus Word
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Problem: Out-of-Vocabulary (OOV)
Traditional word embedding techniques fail when encountering new or rare words.
FastText overcomes this issue by breaking words into subword units (character n-grams).
""",
unsafe_allow_html=True,
)
st.markdown(
"""
Implementing CBOW with Character N-Grams
- Window Size: 5
- Window: 2
- Slide: 1
A tabular format is created with context words and focus words.
""",
unsafe_allow_html=True,
)
st.markdown(
"""
## Example Sentences:
- **d1:** "apple is good for health"
- **d2:** "biryani is not good for health"
This application creates a table for **context words** and **focus words** using **character 2-grams**.
"""
)
st.markdown('''
-Character 2-Gram Table:
- "Context Words": ["ap", "pp", "pl", "le", "is"]
- "Focus Words": ["go", "oo", "od"]
''')
st.markdown(
"""
- This representation provides an **average 2D vector** for words.
"""
)
st.markdown(
"""
Vocabulary
The vocabulary consists of unique character n-grams.
{ keys: values }
where,
- Keys: Character n-grams
- Values: Vector representations
""",
unsafe_allow_html=True,
)
st.markdown(
"""
FastText Model
- The dictionary created is the FastText model.
- Text is broken down into character n-grams to generate vector representations.
- It follows element-wise addition, giving an average 2D representation of the word.
""",
unsafe_allow_html=True,
)