Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2, 2025

Commit

2e651b3

verified ·

1 Parent(s): 6f30acf

Update pages/7_Advance_vectorization_techniques.py

Browse files

Files changed (1) hide show

pages/7_Advance_vectorization_techniques.py +74 -9

pages/7_Advance_vectorization_techniques.py CHANGED Viewed

@@ -278,29 +278,94 @@ if file_type == "Word2Vec":
     st.markdown(
     """
-    <div class='formula'>
         <strong>Final Weighted Representation:</strong>
         <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
         v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3)
                  / (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3))
         </pre>
-    </div>
     """,
     unsafe_allow_html=True,
     )
     st.markdown(
     """
     <div class='box'>
-        <h3 style='color: #6A0572;'> Why This Works?</h3>
         <ul>
-            <li><span class='highlight'>Instead of equal weighting (1)</span>, we use TF-IDF values</li>
-            <li>Gives <strong>more importance</strong> to words that are key in the document</li>
-            <li>Improves the <strong>semantic representation</strong> of text</li>
         </ul>
-    </div>
     """,
     unsafe_allow_html=True,
-)

     st.markdown(
     """
         <strong>Final Weighted Representation:</strong>
         <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
         v_final = (TF-IDF(w1) * v1 + TF-IDF(w2) * v2 + TF-IDF(w3) * v3)
                  / (TF-IDF(w1) + TF-IDF(w2) + TF-IDF(w3))
         </pre>
     """,
     unsafe_allow_html=True,
     )
+    st.subheader("How to train our own W2V model")
+    st.markdown('''
+    - At training time Corpus + W2V algorithm can be implemented by 2 techniques
+    - They are:
+        - Skip-gram
+        - CBOW
+    ''')
+    st.subheader(":red[CBOW]")
     st.markdown(
     """
     <div class='box'>
+        <h3 style='color: #6A0572;'>What is CBOW?</h3>
+        <p><strong>CBOW (Continuous Bag of Words)</strong> is a technique where we use surrounding words (context) to predict the target word (focus word).</p>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+        <h3 style='color: #6A0572;'>📂 Example Corpus</h3>
         <ul>
+            <li><strong>d1:</strong> w1, w2, w3, w4, w5, w4</li>
+            <li><strong>d2:</strong> w3, w4, w5, w2, w1, w2, w3, w4</li>
         </ul>
+        <p>We first preprocess the data to extract meaningful relationships.</p>
     """,
     unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+        <h3 style='color: #6A0572;'>📌 Steps to Process the Data</h3>
+        <ul>
+            <li>Create a <span class='highlight'>vocabulary</span> from the entire corpus: <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">{w1, w2, w3, w4, w5}</pre></li>
+            <li>Generate a <strong>tabular dataset</strong> with:
+                <ul>
+                    <li><strong>Feature variables (Context Words)</strong></li>
+                    <li><strong>Class variables (Target Words)</strong></li>
+                </ul>
+            </li>
+            <li>Apply a <span class='highlight'>window size</span> of 2 (how many neighbors we consider).</li>
+            <li>Slide the window over the text with <span class='highlight'>stride = 1</span>.</li>
+        </ul>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+        <h3 style='color: #6A0572;'> Handling Variable Context Length</h3>
+        <ul>
+            <li>To ensure a consistent feature length, we use <strong>zero-padding</strong> when needed.</li>
+            <li>The model tries to understand relationships based on the surrounding <span class='highlight'>context words</span>.</li>
+        </ul>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+        <strong>Mathematical Representation:</strong>
+        <pre style="background-color:#F7F7F7; padding: 10px; border-radius: 5px;">
+        y = f(xi)
+        where,
+        y = Focus Word (Target)
+        xi = Context Words (Neighbors)
+        </pre>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+        <h3 style='color: #6A0572;'> Training with Artificial Neural Networks</h3>
+        <p>The tabular data is passed to an <strong>Artificial Neural Network (ANN)</strong> which learns:</p>
+        <ul>
+            <li>How <span class='highlight'>context words</span> are related to <span class='highlight'>focus words</span>.</li>
+        </ul>
+    """,
+    unsafe_allow_html=True,
+    )