| import streamlit as st | |
| st.markdown(""" | |
| <style> | |
| /* Set a soft background color */ | |
| body { | |
| background-color: #eef2f7; | |
| } | |
| /* Style for main title */ | |
| h1 { | |
| color: black; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 700; | |
| text-align: center; | |
| margin-bottom: 25px; | |
| } | |
| /* Style for headers */ | |
| h2 { | |
| color: black; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 600; | |
| margin-top: 30px; | |
| } | |
| /* Style for subheaders */ | |
| h3 { | |
| color: red; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 500; | |
| margin-top: 20px; | |
| } | |
| .custom-subheader { | |
| color: black; | |
| font-family: 'Roboto', sans-serif; | |
| font-weight: 600; | |
| margin-bottom: 15px; | |
| } | |
| /* Paragraph styling */ | |
| p { | |
| font-family: 'Georgia', serif; | |
| line-height: 1.8; | |
| color: black; | |
| margin-bottom: 20px; | |
| } | |
| /* List styling with checkmark bullets */ | |
| .icon-bullet { | |
| list-style-type: none; | |
| padding-left: 20px; | |
| } | |
| .icon-bullet li { | |
| font-family: 'Georgia', serif; | |
| font-size: 1.1em; | |
| margin-bottom: 10px; | |
| color: black; | |
| } | |
| .icon-bullet li::before { | |
| content: "◆"; | |
| padding-right: 10px; | |
| color: black; | |
| } | |
| /* Sidebar styling */ | |
| .sidebar .sidebar-content { | |
| background-color: #ffffff; | |
| border-radius: 10px; | |
| padding: 15px; | |
| } | |
| .sidebar h2 { | |
| color: #495057; | |
| } | |
| .step-box { | |
| font-size: 18px; | |
| background-color: #F0F8FF; | |
| padding: 15px; | |
| border-radius: 10px; | |
| box-shadow: 2px 2px 8px #D3D3D3; | |
| line-height: 1.6; | |
| } | |
| .box { | |
| font-size: 18px; | |
| background-color: #F0F8FF; | |
| padding: 15px; | |
| border-radius: 10px; | |
| box-shadow: 2px 2px 8px #D3D3D3; | |
| line-height: 1.6; | |
| } | |
| .title { | |
| font-size: 26px; | |
| font-weight: bold; | |
| color: #E63946; | |
| text-align: center; | |
| margin-bottom: 15px; | |
| } | |
| .formula { | |
| font-size: 20px; | |
| font-weight: bold; | |
| color: #2A9D8F; | |
| background-color: #F7F7F7; | |
| padding: 10px; | |
| border-radius: 5px; | |
| text-align: center; | |
| margin-top: 10px; | |
| } | |
| /* Custom button style */ | |
| .streamlit-button { | |
| background-color: #00FFFF; | |
| color: #000000; | |
| font-weight: bold; | |
| } | |
| </style> | |
| """, unsafe_allow_html=True) | |
| st.header("Vectorization🧭") | |
| st.markdown( | |
| """ | |
| <div class='info-box'> | |
| <p>Vectorization is the process of converting text into vector.</p> | |
| <p>This allows ML models to process text data effectively.</p> | |
| </div> | |
| """, | |
| unsafe_allow_html=True | |
| ) | |
| st.markdown(""" | |
| There are advance vectorization techniques.They are : | |
| <ul class="icon-bullet"> | |
| <li>Word Embedding </li> | |
| <li>Word2Vec </li> | |
| <li>Fasttext</li> | |
| </ul> | |
| """, unsafe_allow_html=True) | |
| st.sidebar.title("Navigation 🧭") | |
| file_type = st.sidebar.radio( | |
| "Choose a Vectorization technique :", | |
| ("Word2Vec", "Fasttext")) | |
| st.header("Word Embedding Technique") | |
| st.markdown(''' | |
| - It is a advanced vectorization technique it converts text into vectors in such a way that it preserves semantic meaning | |
| - All the techniques which preserves semantic meaning while converting text into vector is word embedding technique | |
| - There are 2 word embedding techniques: | |
| - Word2Vec | |
| - Fasttext | |
| ''') | |
| if file_type == "Word2Vec": | |
| st.title(":red[Word2Vec]") | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📌 How Word2Vec Works?</h3> | |
| <ul> | |
| <li>After <strong>training</strong>, we obtain the final <span class='highlight'>Word2Vec model</span></li> | |
| <li>The model stores a <strong>dictionary</strong> with word-vector pairs:</li> | |
| </ul> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| { w1: [v1], w2: [v2], w3: [v3] } | |
| </pre> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>⚙️ Training vs. Test Time</h3> | |
| <ul> | |
| <li><strong>Training Time</strong>: <span class='highlight'>Corpus + Deep Learning Algorithm</span> → Generates Model</li> | |
| <li><strong>Test Time</strong>: <span class='highlight'>Word</span> → Looked up in Dictionary → Returns <span class='highlight'>Vector Representation</span></li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>🔍 How Does It Preserve Meaning?</h3> | |
| <ul> | |
| <li>It learns from the <strong>context</strong> of words in the <span class='highlight'>corpus</span></li> | |
| <li>When given a word, it checks in the dictionary and retrieves the <strong>semantic vector</strong></li> | |
| <li>Unlike other models, <span class='highlight'>dimensions are not words</span>, but their meanings</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📚 Why is Corpus Important?</h3> | |
| <ul> | |
| <li>The <strong>Word2Vec algorithm</strong> is completely dependent on the corpus</li> | |
| <li>Better corpus → Better word representation</li> | |
| <li>It <strong>preserves semantic meaning</strong> using neighborhood words (context)</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown(''' | |
| - Word2Vec is not converting document into vector, it is converting word to vector | |
| - There are 2 techniques by using which we can convert entire document into vector | |
| - They are : | |
| - Average Word2Vec | |
| - TIF-IDF Word2Vec | |
| ''') | |
| st.subheader(":blue[Average Word2Vec]") | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📌 Step-by-Step Process</h3> | |
| <ul> | |
| <li>Given a document <span class='highlight'>d1</span>: <strong>w1, w2, w3</strong></li> | |
| <li>Retrieve vector representations <strong>v1, v2, v3</strong> from Word2Vec</li> | |
| <li>Perform <span class='highlight'>element-wise addition</span> of vectors: | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| v_total = v1 + v2 + v3 | |
| </pre> | |
| </li> | |
| <li>Normalize by dividing by the total number of words (element-wise division): | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| v_avg = v_total / len(d1) | |
| </pre> | |
| </li> | |
| <li>Final representation contains the <span class='highlight'>average meaning</span> of all words</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>⚠️ Problem: Equal Importance to Every Word</h3> | |
| <ul> | |
| <li>Word2Vec assigns <span class='highlight'>equal weight</span> to all words</li> | |
| <li>No emphasis on <strong>important words</strong> that carry significant meaning</li> | |
| <li>This limits the effectiveness in understanding <span class='highlight'>word importance</span></li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <strong>Word2Vec averages word meanings, but lacks weightage for important words! </strong> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.subheader(":blue[TF-IDF Word2Vec]") | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>⚠️ Issue with Word2Vec</h3> | |
| <ul> | |
| <li>Gives equal importance to every word</li> | |
| <li>Even words that appear frequently in a document but rarely in the corpus get equal weight</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>🚀 Solution: Adding Weightage</h3> | |
| <ul> | |
| <li>Consider a document with 3 words: <strong>w1, w2, w3</strong></li> | |
| <li>Each word has a vector representation: | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| w1 → v1, w2 → v2, w3 → v3 | |
| </pre> | |
| </li> | |
| <li>We use <span class='highlight'>two models</span>: | |
| <ul> | |
| <li><strong>TF-IDF</strong> → Computes weightage for each word</li> | |
| <li><strong>Word2Vec</strong> → Converts words into vectors</li> | |
| </ul> | |
| </li> | |
| <li>For each word, multiply its TF-IDF value with its vector</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <strong>Final Weighted Representation:</strong> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3) | |
| / (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3)) | |
| </pre> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.subheader("How to train our own W2V model") | |
| st.markdown(''' | |
| - At training time Corpus + W2V algorithm can be implemented by 2 techniques | |
| - They are: | |
| - Skip-gram | |
| - CBOW | |
| ''') | |
| st.subheader(":red[CBOW]") | |
| st.markdown( | |
| """ | |
| <div class='box'> | |
| <h3 style='color: #6A0572;'>What is CBOW?</h3> | |
| <p><strong>CBOW (Continuous Bag of Words)</strong> is a technique where we use surrounding words (context) to predict the target word (focus word).</p> | |
| </div> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📂 Example Corpus</h3> | |
| <ul> | |
| <li><strong>d1:</strong> w1, w2, w3, w4, w5, w4</li> | |
| <li><strong>d2:</strong> w3, w4, w5, w2, w1, w2, w3, w4</li> | |
| </ul> | |
| <p>We first preprocess the data to extract meaningful relationships.</p> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📌 Steps to Process the Data</h3> | |
| <ul> | |
| <li>Create a <span class='highlight'>vocabulary</span> from the entire corpus: <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">{w1, w2, w3, w4, w5}</pre></li> | |
| <li>Generate a <strong>tabular dataset</strong> with: | |
| <ul> | |
| <li><strong>Feature variables (Context Words)</strong></li> | |
| <li><strong>Class variables (Target Words)</strong></li> | |
| </ul> | |
| </li> | |
| <li>Apply a <span class='highlight'>window size</span> of 2 (how many neighbors we consider).</li> | |
| <li>Slide the window over the text with <span class='highlight'>slide = 1</span>.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> Handling Variable Context Length</h3> | |
| <ul> | |
| <li>To ensure a consistent feature length, we use <strong>zero-padding</strong> when needed.</li> | |
| <li>The model tries to understand relationships based on the surrounding <span class='highlight'>context words</span>.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <strong>Mathematical Representation:</strong> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| y = f(xi) | |
| where, | |
| y = Focus Word (Target) | |
| xi = Context Words (Neighbors) | |
| </pre> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> Training with Artificial Neural Networks</h3> | |
| <p>The tabular data is passed to an <strong>Artificial Neural Network (ANN)</strong> which learns:</p> | |
| <ul> | |
| <li>How <span class='highlight'>context words</span> are related to <span class='highlight'>focus words</span>.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.subheader(":red[Skipgram]") | |
| st.markdown( | |
| """ | |
| <div class='box'> | |
| <h3 style='color: #6A0572;'>What is Skipgram?</h3> | |
| <p><strong>Skipgram</strong> is a technique where we use focus words to predict the context words.</p> | |
| </div> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📂 Example Corpus</h3> | |
| <ul> | |
| <li><strong>d1:</strong> w1, w2, w3, w4, w5, w4</li> | |
| <li><strong>d2:</strong> w3, w4, w5, w2, w1, w2, w3, w4</li> | |
| </ul> | |
| <p>We first preprocess the data to extract meaningful relationships.</p> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>📌 Steps to Process the Data</h3> | |
| <ul> | |
| <li>Create a <span class='highlight'>vocabulary</span> from the entire corpus: <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">{w1, w2, w3, w4, w5}</pre></li> | |
| <li>Generate a <strong>tabular dataset</strong> with: | |
| <ul> | |
| <li><strong>Feature variables (Focus Words)</strong></li> | |
| <li><strong>Class variables (Context Words)</strong></li> | |
| </ul> | |
| </li> | |
| <li>Apply a <span class='highlight'>window size</span> of 2 (how many neighbors we consider).</li> | |
| <li>Slide the window over the text with <span class='highlight'>slide = 1</span>.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> Handling Variable Context Length</h3> | |
| <ul> | |
| <li>To ensure a consistent feature length, we use <strong>zero-padding</strong> when needed.</li> | |
| <li>The model tries to understand relationships<span class='highlight'>focus words</span>.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <strong>Mathematical Representation:</strong> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| y = f(xi) | |
| where, | |
| y = Context Word | |
| xi = Focus Words | |
| </pre> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> Training with Artificial Neural Networks</h3> | |
| <p>The tabular data is passed to an <strong>Artificial Neural Network (ANN)</strong> which learns:</p> | |
| <ul> | |
| <li>How <span class='highlight'>focus words</span> are related with <span class='highlight'>context words</span>.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| elif file_type == "Fasttext": | |
| st.title(":red[Fasttext]") | |
| st.markdown( | |
| """ | |
| <p><strong>FastText</strong> is an advanced word vectorization technique that enhances word embeddings by considering subword information.</p> | |
| <p>It is a <span class='highlight'>simple extension</span> of Word2Vec, which converts words into vectors.</p> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> Implementing FastText</h3> | |
| <p>FastText can be implemented using:</p> | |
| <ul> | |
| <li><strong>CBOW (Continuous Bag of Words)</strong></li> | |
| <li><strong>Skip-gram</strong></li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <strong>CBOW Representation:</strong> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| y = f(xi) | |
| where, | |
| y = Focus Word | |
| xi = Context Words | |
| </pre> | |
| <strong>Skip-gram Representation:</strong> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| y = f(xi) | |
| where, | |
| y = Context Words | |
| xi = Focus Word | |
| </pre> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> Problem: Out-of-Vocabulary (OOV)</h3> | |
| <p>Traditional word embedding techniques fail when encountering new or rare words.</p> | |
| <p><span class='highlight'>FastText overcomes this issue</span> by breaking words into subword units (character n-grams).</p> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>Implementing CBOW with Character N-Grams</h3> | |
| <ul> | |
| <li><span class='highlight'>Window Size</span>: 5</li> | |
| <li><span class='highlight'>Window</span>: 2</li> | |
| <li><span class='highlight'>Slide</span>: 1</li> | |
| </ul> | |
| <p>A tabular format is created with <strong>context words</strong> and <strong>focus words</strong>.</p> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| ## Example Sentences: | |
| - **d1:** "apple is good for health" | |
| - **d2:** "biryani is not good for health" | |
| This application creates a table for **context words** and **focus words** using **character 2-grams**. | |
| """ | |
| ) | |
| st.markdown(''' | |
| -Character 2-Gram Table: | |
| - "Context Words": ["ap", "pp", "pl", "le", "is"] | |
| - "Focus Words": ["go", "oo", "od"] | |
| ''') | |
| st.markdown( | |
| """ | |
| - This representation provides an **average 2D vector** for words. | |
| """ | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'>Vocabulary</h3> | |
| <p>The vocabulary consists of <span class='highlight'>unique character n-grams</span>.</p> | |
| <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;"> | |
| { keys: values } | |
| where, | |
| - Keys: Character n-grams | |
| - Values: Vector representations | |
| </pre> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |
| st.markdown( | |
| """ | |
| <h3 style='color: #6A0572;'> FastText Model</h3> | |
| <ul> | |
| <li>The dictionary created is the <span class='highlight'>FastText model</span>.</li> | |
| <li>Text is broken down into <strong>character n-grams</strong> to generate vector representations.</li> | |
| <li>It follows <span class='highlight'>element-wise addition</span>, giving an <strong>average 2D representation</strong> of the word.</li> | |
| </ul> | |
| """, | |
| unsafe_allow_html=True, | |
| ) | |