Spaces:

Harika22
/

Natural_Language_Processing

Sleeping

App Files Files Community

Harika22 commited on Feb 2, 2025

Commit

d019295

verified ·

1 Parent(s): 7007a94

Update pages/6_Feature_Engineering.py

Browse files

Files changed (1) hide show

pages/6_Feature_Engineering.py +98 -0

pages/6_Feature_Engineering.py CHANGED Viewed

@@ -67,6 +67,24 @@ st.markdown("""
     .sidebar h2 {
         color: #495057;
     }
     /* Custom button style */
     .streamlit-button {
         background-color: #00FFFF;
@@ -378,4 +396,84 @@ elif file_type == "Bag of Words(BOW)":
 elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.title(":red[Term Frequency - Inverse Document Frequency(TF-IDF)]")

     .sidebar h2 {
         color: #495057;
     }
+    .step-box {
+        font-size: 18px;
+        background-color: #F0F8FF;
+        padding: 15px;
+        border-radius: 10px;
+        box-shadow: 2px 2px 8px #D3D3D3;
+        line-height: 1.6;
+    }
+    .formula {
+        font-size: 20px;
+        font-weight: bold;
+        color: #2A9D8F;
+        background-color: #F7F7F7;
+        padding: 10px;
+        border-radius: 5px;
+        text-align: center;
+        margin-top: 10px;
+    }
     /* Custom button style */
     .streamlit-button {
         background-color: #00FFFF;
 elif file_type == "Term Frequency - Inverse Document Frequency(TF-IDF)":
     st.title(":red[Term Frequency - Inverse Document Frequency(TF-IDF)]")
+    st.markdown("""
+        ### 📌 What is Bag of Words(BOW)?
+        -  It is a type of vectorization technique where text is converted into a numerical vector.
+    """)
+    st.subheader(":violet[🛠️ Steps in TF-IDF]")
+    st.markdown(
+    """
+    <div class='step-box'>
+        <ul>
+            <li><strong>Create a vocabulary:</strong> A set of unique words from the corpus.</li>
+            <li><strong>Convert each document into a vector:</strong> A d-dimensional representation.</li>
+            <li><strong>Calculate Term Frequency (TF):</strong> Measures the importance of a word within a document.</li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown("<div class='formula'>TF(wᵢ, dᵢ) = (Occurrences of wᵢ in dᵢ) / (Total words in dᵢ)</div>", unsafe_allow_html=True)
+    st.markdown(
+    """
+    <div class='step-box'>
+        <ul>
+            <li><strong>Compute Inverse Document Frequency (IDF):</strong> Measures how important a word is across all documents.</li>
+            <li><strong>For every word in the vocabulary, apply IDF:</strong></li>
+        </ul>
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown("<div class='formula'>IDF(wᵢ, C) = log(N/n)</div>", unsafe_allow_html=True)
+    st.markdown(
+    """
+    <div class='step-box'>
+        - <strong>N:</strong> Total number of documents in the corpus.<br>
+        - <strong>n:</strong> Number of documents containing the word wᵢ.<br>
+        - TF-IDF helps in understanding word significance while reducing the impact of commonly used words.
+    </div>
+    """,
+    unsafe_allow_html=True,
+    )
+    st.markdown('''Example of TF-IDF
+    - In a corpus there are 3 documents d1, d2, d3
+        - d1 ➡️ w1, w2, w3, w1 ➡️ v1
+        - d1 ➡️ w1, w2, w2, w3, w4, w2, w3 ➡️ v2
+        - d1 ➡️ w1, w5 ➡️ v3
+    - values are product of two values
+    - wi = ith representation of word
+    - Vocabulary = {w1, w2, w3, w4, w5}
+    - len(voc) = 5
+        - TF(w1, d1) = 2/4
+        - TF(w2, d1) = 1/4
+        - TF(w3, d1) = 1/4
+        - TF(w4, d1) = 0/4
+        - TF(w5, d1) = 0/4
+    - TF value for every word will be going on changing as the document changes
+    - TF lies between 0 and 1 [0 ... 1] ( sort of probability)
+    - Case-1 : TF = 0 → that wi is not present in particular di
+    - Case-2 : TF = 1 → that wi is the only word present in particular di
+    - IDF(wi, C) = log(N/n)
+    - n= total no.of documents which contains wi
+    - N = total no.of documents
+    - IDF values lies between >=0 to ∞(infinite)
+        - IDF(w1, C) = log(3/3)
+        - IDF(w2, C) = log(3/2)
+        - IDF(w3, C) = log(3/2)
+        - IDF(w4, C) = log(3/1)
+        - IDF(w5, C) = log(3/1)
+    - Tf(wi, di) is calculated and stored in memory
+    - Converting document to vector by product of TF and IDF
+    - d1:v1 [0,0.04,0.04,0,0] → TF*IDF values
+    - TF * IDF values can be low or high or zero
+    ''')