Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2, 2025

Commit

f9c5382

verified ·

1 Parent(s): 6ac7719

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +108 -34

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -458,38 +458,112 @@ elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     """,
     unsafe_allow_html=True,
     )
-    st.markdown('''Example of TF-IDF
-    - In a corpus there are 3 documents d1, d2, d3
-        - d1 ➡️ w1, w2, w3, w1 ➡️ v1
-        - d1 ➡️ w1, w2, w2, w3, w4, w2, w3 ➡️ v2
-        - d1 ➡️ w1, w5 ➡️ v3
-    - values are product of two values
-    - wi = ith representation of word
-    - Vocabulary = {w1, w2, w3, w4, w5}
-    - len(voc) = 5
-        - TF(w1, d1) = 2/4
-        - TF(w2, d1) = 1/4
-        - TF(w3, d1) = 1/4
-        - TF(w4, d1) = 0/4
-        - TF(w5, d1) = 0/4
-    - TF value for every word will be going on changing as the document changes
-    - TF lies between 0 and 1 [0 ... 1] ( sort of probability)
-    - Case-1 : TF = 0 → that wi is not present in particular di
-    - Case-2 : TF = 1 → that wi is the only word present in particular di
-    - IDF(wi, C) = log(N/n)
-    - n= total no.of documents which contains wi
-    - N = total no.of documents
-    - IDF values lies between >=0 to ∞(infinite)
-        - IDF(w1, C) = log(3/3)
-        - IDF(w2, C) = log(3/2)
-        - IDF(w3, C) = log(3/2)
-        - IDF(w4, C) = log(3/1)
-        - IDF(w5, C) = log(3/1)
-    - Tf(wi, di) is calculated and stored in memory
-    - Converting document to vector by product of TF and IDF
-    - d1:v1 [0,0.04,0.04,0,0] → TF*IDF values
-    - TF * IDF values can be low or high or zero
-    ''')

     """,
     unsafe_allow_html=True,
     )
+    st.markdown("<h1 class='title'>📌 Example of TF-IDF</h1>", unsafe_allow_html=True)
+    st.markdown(
+    """
+    <div class='box'>
+        <strong>Given a corpus with 3 documents:</strong><br><br>
+        <strong>d1:</strong> w1, w2, w3, w1 → v1 <br>
+        <strong>d2:</strong> w1, w2, w2, w3, w4, w2, w3 → v2 <br>
+        <strong>d3:</strong> w1, w5 → v3 <br><br>
+        <strong>Vocabulary:</strong> {w1, w2, w3, w4, w5} <br>
+        <strong>Vocabulary Size:</strong> 5 (d-dimension)
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown("<h2 style='color: #6A0572;'>📊 Term Frequency (TF) Calculation</h2>", unsafe_allow_html=True)
+    st.markdown(
+    """
+    <div class='box'>
+        <ul>
+            <li>TF measures how often a word appears in a document.</li>
+            <li>Formula: <span class='highlight'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</span></li>
+            <li>TF values change based on the document.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+    <div class='formula'>
+        TF(w1, d1) = 2/4 = 0.5 <br>
+        TF(w2, d1) = 1/4 = 0.25 <br>
+        TF(w3, d1) = 1/4 = 0.25 <br>
+        TF(w4, d1) = 0/4 = 0 <br>
+        TF(w5, d1) = 0/4 = 0 <br>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+    <div class='box'>
+        <ul>
+            <li>TF values always range from <strong>0 to 1</strong>.</li>
+            <li>Case-1: <span class='highlight'>TF = 0</span> → Word is not present in the document.</li>
+            <li>Case-2: <span class='highlight'>TF = 1</span> → Word is the only word in the document.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown("<h2 style='color: #6A0572;'>📉 Inverse Document Frequency (IDF) Calculation</h2>", unsafe_allow_html=True)
+    st.markdown(
+    """
+    <div class='box'>
+        <ul>
+            <li>IDF measures how important a word is across the entire corpus.</li>
+            <li>Formula: <span class='highlight'>IDF(wᵢ, C) = log(N/n)</span></li>
+            <li>N = Total number of documents.</li>
+            <li>n = Number of documents containing wᵢ.</li>
+            <li>IDF values range from <strong>0 to ∞</strong>.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown("<h2 style='color: #6A0572;'>📌 TF-IDF Calculation</h2>", unsafe_allow_html=True)
+    st.markdown(
+    """
+    <div class='box'>
+        <ul>
+            <li>We calculate TF-IDF by multiplying TF and IDF values.</li>
+            <li>Formula: <span class='highlight'>TF-IDF = TF * IDF</span></li>
+            <li>TF-IDF helps reduce the impact of frequent words while keeping rare words important.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+    <div class='formula'>
+        d1 → v1 = [0, 0.04, 0.04, 0, 0] (TF * IDF values)
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown(
+    """
+    <div class='box'>
+        - The final TF-IDF values may be low, high, or even zero depending on term frequency and document frequency.
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+st.markdown("<p style='text-align: center; font-size: 18px;'><strong>TF-IDF effectively balances word significance and document relevance! 🚀</strong></p>", unsafe_allow_html=True)